mad_ady wrote: ↑
Tue Jul 16, 2019 2:25 am
I'm not familiar with infiniband (is it mostly found in the datacenter?)
InfiniBand is a non-ethernet protocol originally created by Mellanox. There is now an organization IBTA defining the standard (where obviously Mellanox is a big 800-pound Gorilla). Mellanox has been/is enormously successful and was bought for big $$$ by NVidia. Intel could not buy them probably a story of $$$ but also of potential governmental rejection because they already bought QLogic and others in the same industry. NVidia is happy: they now have a fast interconnect (using RDMA, continue reading) for their GPU racks and farms.
Their protocol can run on passive DAC copper or on active optical cables. The latter allows long distances (several hundred meters). Example: universities or large organizations or very huge data centers.
InfiniBand is NOT Ethernet: it does not use the IP stack. It instead uses RDMA (Remote Direct Memory Access). It is based on the notion of "client" app and "server" app. The RDMA concept is very simple: a "client" app issues an RDMA request, the NIC copies the data directly from the app buffers (which the app created for that purpose.) The NIC then takes care of sending it to the "server" NIC. Upon reception the latter copies the data directly to the "server" app memory (which the app created for that purpose.) It is fast, low latency and the CPUs on the "client" and "server" machines are not involved in the network request-response between the two apps.
From there, Mellanox implemented IPoIB (IP over Infiniband) to transport IP-based network communication over its RDMA protocol. Quite successful: data centers and large organizations love it because they can leverage the IB speed using their IP-based apps with no code change. Downside: there is overhead in putting the IP packets on top of RDMA packets and... the CPUs are back in the loop managing the IP stack. Let's say you loose 30% of bandwith, but it is still very fast compared to 1Gbe or 10Gbe Ethernet.
From there, under the pressure of its customers and competitors (i.e. QLogic, Chelsio, etc), Mellanox implemented RoCE (RDMA over Converged Ethernet). Big term just to say that you can do RDMA over Ethernet (basically the reverse of IPoIB). Again the data centers and large organizations love it because if they already have good quality Ethernet cables: they don't have to lay new cables.
Competitors to Mellanox (i.e. Chelsio) will say that InfiniBand is useless since the introduction of RoCE (in their dreams, meanwhile Mellanox makes money).
Intel is impotently furious, they were the top number one for Gigabit Ethernet but since then, they never were able to catch up with Mellanox. Their 100Gbe cards suck and their OmniPath network thingy does not really move the populace. Intel is working hard on catching up the 100Gbe wagon, but Mellanox is already selling 200Gbe solution (NICs are two cards using two PCIe 3 x16). Both Mellanox and Broadcom already have 400Gbe sample products.
Mellanox designs and sells network cards, switches and cables. All the Big Guys like Dell, EMC, HP, etc. distributed/distribute network products made by Mellanox just putting their name on it. Given the torrent of hardware sold in the last ten years and the data centers as well as organizations updating their hardware for even faster speed, the used products end up on eBay at prices you can resist. You can get used dual 40/56Gbps Mellanox card for less $75 (sometimes less than $50), meaning less than the cheapest 10Gbe Ethernet cards on sales (as of this writing). You just need to be patient waiting for a (reputable) guy selling at low price. Risk? Not too much, you just to need check the % feedback of the vendor, read carefully the description and refer to the Mellanox technical archives to find out what the product exactly is. In other words, do your home work.
Note that the MSRP for these cards and cables are in the $300, $400 range. Although these are "enterprise prices", I doubt anybody bought them at these prices. As they say, nothing sells more than a 25, 30, 50% discount, so let's start with very high prices.
Regarding the 'In other words, do your home work.' it failed for me twice: once I received two optical cables that where bent but seemed to work. It's only later I found out that they triggered too many retries (IB is a lossless protocol: it won't theoretically drop a packet.) Cost: $100. Then the 18-port 56Gbps switch: I checked it when I received it, turned it on, connected a PC to it, launched the network manager: all OK! It is only last week-end I found out that this is basically the only things that work: connecting another PC to port 2 to 18 never activated the connection. Cost: $175. So my advice is simple: try everything when receiving the hardware. For me it was too late, also the returning costs (on me) not worth the trouble. I'll buy another one when I'm tired of 40Gbs.
Mellanox cards can be IB or Ethernet cards. With a dual card you can use port 1 for IB, same for port 2 but you can also use Ethernet on port 2 while still using IB on port 1.
As an initial try, you do not need a switch. Just buy two cards and connect them directly, launch the network manager and you have a 40 or 56Gbs connect between your two PCs (depending on the cable). Cost? Let's say $75 per card (if you are impatient) and $50 for a 3-meter optical cable (40Gbs) or $60 (56Gbs). All the software you need is downloadable (free of charge) from the Mellanox web site.
Card (you need two of them)
https://www.ebay.com/sch/i.html?_from=R ... BT&_sop=15
https://www.ebay.com/sch/i.html?_from=R ... BT&_sop=15
Note: these used cards come with a low-profile bracket, so buy the full bracket too ($5 on eBay).
Optical cable (you need one)
https://www.ebay.com/sch/i.html?_from=R ... 10&_sop=15
https://www.ebay.com/sch/i.html?_from=R ... 1V&_sop=15
WARNING: as mentioned earlier in this thread, these cards expect a PCIe 3 x8 slot to work at max speed. Less than 8 lanes and/or less than PCIe 3 (i.e. PCIe 2) will significantly degrades the speed (you could end up at 12 GbE with IPoIB on "old" hardware as shown earlier in this thread).
Software you'll need (*):
- Linux: http://www.mellanox.com/page/products_d ... _family=26
(different versions for each distrib, download the one you need)
For linux, you are looking for MLNX_OFED_LINUX-4.6-220.127.116.11-rhel7.6-x86_64.tgz, or MLNX_OFED_LINUX-4.6-18.104.22.168-ubuntu18.04-x86_64.tgz, etc.
- Windows: https://www.mellanox.com/page/products_ ... sw_drivers
For Windows, you are looking for MLNX_VPI_WinOF-5_50_52000_All_win2019_x64.exe or later.
- Linux tools for updating card firmware (package is called MFT, app is called flint): http://www.mellanox.com/page/management_tools
- Most recent firmware for the two cards mentioned above: http://www.mellanox.com/page/firmware_table_ConnectX3IB
- Files are:
MCX354A-FCBT MT_1090120019 fw-ConnectX3-rel-2_42_5000-MCX354A-FCB_A2-A5-FlexBoot-3.4.752 044a3e082f9dc6ec0ac458d3ad0274be
MCX353A-FCBT MT_1100120019 fw-ConnectX3-rel-2_42_5000-MCX353A-FCB_A2-A5-FlexBoot-3.4.752 d18b52f5464dbff50b88271e9a86de66
(*) Sometimes the Mellanox website is broken (missing CSS?). If you see pages that look like mid-90s style, try later until the site looks modern. Each download page has a lot of blah blah blah with at the bottom a clickable, expandable multi-column list. Usually, you also find the links to the corresponding manuals. These manuals do not spend much time into "tutorial mode" so have fun with them.
WARNING: the firmware 2.42.5000 is required by the most recents version of MLNX_OFED so you'll probably have to update the cards firmware. Most of the cards on eBay are OEM cards from Dell, HP, etc not updated with the last firmware. The current Mellanox firmware tool will NOT let you override the OEM firmware. The workaround is to use an older version of MFT that does. After trial and errors going through the older versions... but not too old (!)... I found out that this version of MFT does the job: mft-4.0.0-53.tgz, you'll have to use the flint option: --allow-psid-override (**) It's from memory but I think it is this option.
(**) This means that you KNOW what you're doing: burning the wrong firmware on a card will make it useless.
Because this version of MFT is old, it will be overridden when you then install MLNX OFED. Simple process trick: have an old PC able to recognize the Mellanox PCIe cards and only install mft-4.0.0-53.tgz on it. Burn the firmware on that old PC then move the card to the PC you'll actually use where MLNX OFED is/will be installed. In doing so, you do not spend your time installing MFT, installing MLNX OFED, uninstalling MLNX_OFED, reinstalling MFT for each card you buy. That is if you buy many cards over months waiting for the good deal on eBay.
Note: the card mentioned here are ConnectX-3. Do not buy ConnectX-2, too old. You will quickly see that used ConnectX-4 and ConnectX-5 are horribly expensive (that's because they are pretty recent and allow 100Gbs). ConnectX-3 "only" does 40 and 56Gbs.
If you do not want to update the cards firmware (most the recent features of MLNX_OFED are for ConnectX-4 and ConnectX-5 after all) do not use MLNX OFED. Most of the InfiniBand daemons, drivers and tolls you need are built-in in Linux. The kernel must be RDMA enabled (usually the case on recent distributions) and look for the InfiniBand packages available for your Linux via apt or yum. The MLNX OFED installer for Linux installs very similar files plus niceties (meaning the most recent bug fixes and additional tools), that's it. Mellanox even have manuals telling you which packages you have to install if not using MLNX_OFED (example: Ubuntu_18_04_Inbox_Driver_User_Manual.pdf, Ubuntu_19_04_Inbox_Driver_User_Manual.pdf).
Tip: with Ubuntu 19.04, clone https://github.com/linux-rdma/rdma-core
(or use pip3) to get the new Python PyVerbs package. It exposes the InfiniBand entrails for Python. Have fun, documentation is scarce at this point (tip: read the source code and refer to the RDMA C/C++ documentation). Useful only if you want to write a Python that uses RDMA instead of simply communicate with no change over IPoIB.
InfiniBand on ARM? Possible but you'll have to rebuild the kernel with RDMA enabled and then build the github project mentioned in the previous paragraph. That's for the adventurous and you have to find a price-accessible ARM board with a PCIe slot with enough oomph to handle the card.
So after reading this, you precipitously spent about $75 + $75 + $50/$60 on eBay. You got the cards, the cable. You passed flying colors updating the cards firmware, you installed MLNX OFED. What do you do from there? Let's say your 2 PCs run Linux (RedHat, CentOS, Ubuntu, SuSE).
Your two cards will probably have a green light on which means they detected that there is something out there.
Run (as root or sudo) on one machine:
service opensmd start
This will start the network manager (note: you also do that for unmanaged switches, the software comes from one PC via the IB card)
Note: you can start opensmd on several PCs, does not hurt.
Both cards should now have a shining yellow light on. If not rinse and repeat using the documentation (or just continue, maybe the LED is dead).
Tip: use ibv_devinfo which display the current card configuration and status (including port_lid)
On each PC:
- Load the IPoIB driver: modprobe ipoib.
- Using NetworkManager's nmtui, create a new connection (InfiBand will be in the menu), assign a fixed IP address to the card (using a different subnet than your current one is OK and preferable as a first experiment). On dual port cards, this will be ib0. Activate the connection. Get out of nmtui.
- Check the result with:
At this point you should be able to ping each other PC over IPoIB. If not rinse and repeat using the documentation.
A few "pure" IB commands as first steps:
List hosts present on the local net
List the switches present on the local net (if you finally buy one)
List the links present on the local net (each PC should see each other)
ibping -S -d -v
Starts IB ping server on one hosts
Ping first host on the local net, the "1" is the LID of the PC acting as ping server. The LIDs are given in iblinkinfo.
Display current card configuration and status
For testing IPoIB, use iperf3 over the IP addresses you assigned to the IB cards. If you have "good" machines you should get 30++GbE transfer over 40Gbs, more over 56Gbs.
Thanks to IPoIB, you can use ssh, samba, whatever IP-based app over the fast connect. For example compare copying files (via Samba) over your regular gigabit network and then over IPoIB to feel the difference (mostly apparent when you copy a giant multi-GB file, with smaller files the overall speed degrades, like it does over 1GbE).
On Windows, you may have to use some obsure incantation in PowerShell to tell the SMB client to use all the NICs, and most importantly the correct one, when accessing the same machine via different names of IPs. I forgot what it was and forgot to type it down so you'll have to look for it.
Microsoft SMB Direct does not work with Samba (because Samba does not implement it yet). So you have to use a recent Windows Server and Windows 10 if you want to evaluate SMB Direct.
If you have 3 PCs, you can still do it without a switch. You can daisy chain using dual cards (speed can suffer between to PCs communicating through the 3rd one in the middle).
Source of information:
- Mellanox Manuals,
(registration is free)
- Enterprise web sites forums (i.e. https://forums.servethehome.com/index.php
, free registration)
Just be aware that this technology has been going for 10+ years, so there is plenty of obsolete information on the Net. Practice good judgement in your reading.
THE INFORMATION IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE INFORMATION OR THE USE OR OTHER DEALINGS IN THE
I did not write this post, my chimpanzee did.