Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×
Hardware

Developing a New Beowulf Architecture? 86

Peter Gant asks: "By day I'm the sys admin on a mixed Windows / Linux environment but in my spare time I work on a small Beowolf system comprising eight nodes and a small rack, it's only small but it's mine . The system is put together using 100Mb network cards and a switch and this is that part that's been bugging me, I have a system where the slowest CPU runs at 600MHz but the network links run at (assuming a single start and stop bit) ten million bytes a second. There are of course better ways of doing this. I could rip out all of the 100Mb cards and fit gigabit ethernet running over copper or fiber but most switches only have a single gigabit port and multiple-port gigabit switches are damned expensive. There's also the possibility of using Myrinet but unless I mortgage the house and sell my girlfriend into slavery this isn't a realistic option." It gets more detailed in the article. If you are interested in Beowulf discussions, maybe this question will provide some grist for the grey matter.

"Both gigabit ethernet and Myrinet still have one fundamental weakness, a weakness that goes back to the original days of networking, they are a SERIAL medium. Even if you use the fastest technology possible you are still sending bits one at a time down a single pipe. it's like having a single lane highway between L.A. and San Francisco with each car running at 10,000 mph so that you can cope with the bandwidth, it might work but it's a damn silly solution. I therefore propose a new networking solution for use in cluster systems, parallel networking. This isn't as silly as it sounds because we use this solution at work to link two switches, two 100Mb network connections are concatenated together to form a single 200Mb link, but what I propose goes further.

The new system takes advantage of the seven-layer OSI model and separates the new hardware from the operating system. So far as the system is concerned each node has a single network card but the interface is where I propose the change. Every network card includes one or more shift registers which take the parallel information off the PCI bus and convert it to a serial bit stream so that it can be sent along the network cable and when data is received the hardware operates in reverse converting serial to parallel. The new cards replace these shift registers with thirty two (or maybe sixteen) bit latches and the network connector at the back of the card has (say) forty pins. This would allow the use of thirty two pins for data and eight for handshaking and if the new eighty-core IDE cables are used then crosstalk would not be a problem. It's a similar approach to the Digital Video Out connector on some high-end video cards that allow you to connect a flat screen monitor without going through the D to A convertors. Each node has its own cable connecting into the network switch which (as the connections are now thirty two bits wide) would be a 32 x n switch where 'n' would be the number of nodes in the cluster.

Assuming that the idea can fly we would need to develop the following:

1) The new network cards. This isn't as difficult as it seems as a lot of the work has already been done by every network card vendor. With modern ASICs the task of appearing to the system as a NIC whilst presenting the data to the port thirty two bits at a time could be dealt with by a single chip. All it needs is someone to design the chip. If we use standard forty-pin connectors then users can buy the cables off the shelf. To keep things on track we would need to implement all of the NIC functions including giving it a MAC address so that a TCP/IP stack could be implemented.

2) The network switch. A network switch handling data thirty two bits at a time is not a trivial item but I am sure that it can be done. A number of IC manufacturers have crosspoint switches as part of their catalogue and all that needs to be done is to expand the process further. Given the nature of the task it might be possible to carry out the switching using a hardware only solution which would reduce latency even further.

3) The software. Assuming that the new cards appear on the PCI bus as an ordinary NIC then drivers should not be much of a problem. These would probably have to be developed at the same time as the network card. Drivers should include all the required software so that the NIC can work with the kernel but windows drivers as well would be nice.

One final thought, this solution could also be applied to other fields. Want to build a SAN PC and wire it to a pair of servers running My SQL ? Well, you now have a nice fast communication medium.

So, there you have it. Assuming this idea works then we now have a way to increase the speed of a network by reducing the latency rather than throwing more or faster CPUs at the problem. In the spirit of Open Source I do not propose to patent this idea, I want everyone to take the ideas presented here, play around with them, and if a university student is looking for his (or her) final year project they are welcome to give this a try. Should any of you have comments regarding this idea then post away. I should however point out that I'm a great fan of practical criticism, feel free to say that the idea sucks but if you do say WHY it sucks and HOW it can be improved."

This discussion has been archived. No new comments can be posted.

Developing a New Beowulf Architecture?

Comments Filter:
  • Oh my word! (Score:5, Funny)

    by Anonymous Coward on Thursday November 14, 2002 @05:26PM (#4672068)
    Imagine a Beowulf cluster of these!
    • I think that this article [slashdot.org] might be useful to you. It describes a commercially available NIC that does much of the network processing on the NIC.



      I think that the price is a bit out of your range for now. My understanding is that they are running about $1000 each. Wait, and the price will drop, tho.



    • http://www.networkcomputing.com/1318/1318ibg12.htm l
  • Mmmm... (Score:5, Funny)

    by darkov ( 261309 ) on Thursday November 14, 2002 @05:29PM (#4672091)
    sell my girlfriend into slavery

    I'll give you ten bucks ... if she's cute ... and a goer.
    • Why would you want to pay for a girlfriend haven't you seen the Girlfriend Open Source project on sourceforge?
    • sell my girlfriend into slavery

      I'll give you ten bucks ... if she's cute ... and a goer.

      Whoa, whoa, whoa. The fact that this guy even has a girlfriends is astounding. Now you want her to be cute and, as you so delicately put it, 'a goer'?

      I think you should thank your lucky stars that she's not inflatable and leave it at that.

  • Firewire or USB2.0 (Score:3, Insightful)

    by Speedy8 ( 594486 ) on Thursday November 14, 2002 @05:33PM (#4672125) Journal
    You might want to look into a custom solution using USB2.0 or Firewire. These can theoretically get you 300+ Megabytes per second (Limit of the PCI bus). It won't be an easy solution to pull off but it is definantly doable.
    • The PCI Bus on consumer systems is 32-bit @ 33MHz.. therefore peak bandwidth around 133Megabytes/sec (not including bus switching latency). Just enough to fit about a gigabit of bandwidth in.
      • by vipw ( 228 )
        Not to mention that neither firewire or usb2.0 approach that speed. They are 400 megabit/s and 480 megabit/s respectively. I really don't know what the original poster was on. Maybe it was deliberate misinformation, I just don't see the point of that though.
    • by Ashran ( 107876 )
      USB2: 480 MBit/s -> 60 Megabyte/sec
      Firewire: 400 MBit/s -> 50 Megabytes/sec

      which is still a lot faster than 100mbit cards
  • by reinard ( 105934 )
    With 8 nodes, I don't think you even need a switch. Just put a couple intel 4 port 100 mbit cards in there and link each node with each node.
    That should give you lots more bandwidth and eliminate an expensive switch, and a few nanoseconds of latency.
    • Good idea, and great on a small number of nodes. Unfortunately, it doesn't scale well (you can only have so many 4-port cards in a system...), not to mention being a horror to maintain (adding a new node? You have to run one cable to each old node... and hope that every other node has a free port. Otherwise you have to take every node down, add a new 4-port card, and power back up!)

      Besides, I don't know the price of a 4-port NIC, but I'd imagine they're pretty pricey... considering you'd need 2 for each system in a 8-node cluster, you could probably buy a nice 24-port switch for the same cost.

      This guy is talking about a new 'standard' for a cluster of arbitrary size, not some hobbyist's hack in a basement. You need to think in terms of scalability here.

      - Jester
      • A 4 port nic is not expensive. I bought mine a little less than 3X the price of a regular nic with the same chipset.
        So it's much less expensive than a hub for a small network.
        Since I've got pretty crappy hardware everywhere I can't say for sure it's the reason for the bad latency in my 3 pieces network, but it wouldnt be honnest not to mention it.
      • If you organize your nodes in a deBruijn graph, you get log2(n) hops max between n nodes.

        To do this, you only need four interfaces per node (actually, you need two transmit and two receive, so could get away with two, but let's stick to four so we don't have to hack network layer code).

        Assuming you have 2^n nodes, you number them from 0 to 2^n-1. Node n is configured to transmit to nodes 2n mod 2^n and (2n+1) mod 2^n. By extention, it receives from nodes n / 2 and n / 2 + 2^(n-1).

        This lets you get a message from one node to another in at most n hops with 2^n nodes and four (or two) nics per node.

    • Actually I had the pleasure of meeting fathers
      of the original Beowulf (Don Becker and Tom Sterling)
      and their story goes that it was originally built
      exactly like this: a bunch of 4-port cards
      connected in hypercube configuration. By the
      way, for this case you can scale it to 2^4=16
      nodes with 3 hops worst case latency.

      The reason for this was that Don at the time was
      writing a linux driver for that particular card
      and needed some justification for that
      activity... :)

      Problem with this approach is that you do message routing in software running on your computational
      nodes, not too efficient compared to dedicated
      hardware on a switch. Thus, switches were used
      ever since...

      Paul B.
    • Lots of crossover cables eh? That would indeed make for point to point connections but imagine having eth0-eth7 to manage in your startup scripts... What would /etc/hosts look like? can these nics all get the same IP?
  • Load balancing (Score:4, Insightful)

    by OrangeSpyderMan ( 589635 ) on Thursday November 14, 2002 @05:35PM (#4672142)
    I don't know is this is a pratical possibility, but IIRC Linux 2.4.X can load balance a single network connection over several physical NICs - could this not be a "quick and dirty" for your problem? This [linuxguruz.org] could be a starting point..
  • Serial is faster (Score:4, Informative)

    by darkov ( 261309 ) on Thursday November 14, 2002 @05:37PM (#4672156)
    Both gigabit ethernet and Myrinet still have one fundamental weakness, a weakness that goes back to the original days of networking, they are a SERIAL medium. Even if you use the fastest technology possible you are still sending bits one at a time down a single pipe

    As it happens, parrallel interconnection's days are numbered becuase they are fundamentally limited as tranmission speed increases. As the speed goes up you increasingly have problems with things like interactions between data lines and having the data arrive at the same time on each line. So, ironically, less lines means you can go faster and provide more bandwidth.
    • I'm sorry, but that doesn't make much sense. Interactions between data lines can easily be avoided by simple spacing or shielding. Handling data on each line is trivial as well.

      Up the serial speed as much as you want, and you can still up it with a parallel interconnect.
      • What the author of this article is talking about more than anything else are the advantages of having parallel vs. serial communication. But his primary gripe is not a valid one. He states that the problem with these serial connections is one of latency. Well, I don't know where he got the crack he's smoking, but I myself would like some. If something is running at 1Gbps on either serial or parallel hardware, the serial would actually have lower latency and therefore more potential to deliver more immediate responses from data. The real problem herein lies with the PCI bus itself being a bus and not just something that delivers a continuous stream of data. Maybe they should make a Serial ATA nic card, and then his latency issue should go out the door (assuming the serial ATA interface is not being driven off the PCI bus). Also, it sounds like the software he's using isn't performing in a very parallel nature, unless he's trying to develop a distributed neural network (but even then sorting could be done to keep nodes of higher weighted interconnects on the same machine thereby reducing some relevant latency). Why does he need such a reduction in latency anyways?
      • The reason why it's harder to go fast with parallel connections compared to serial is that you need to ensure that all information arrives at the same time. If you have a 40 pin bus then at the other end you must be sure that all 40 pins are stable before you "grab" the data from the bus. Naturally the problem becomes more and more difficult as the speed of the bus increases. (Because that mainly mean that you can't wait as long for the connections to stabalize.)

        This problem is avoided with serial connections.

        And as someone already pointed out, this is why Serial ATA is on the rise now.
        • The other guy who responded to me actually clarified my point.

          it's multi-serial, multiple serial connections running in parallel. A datagram gets chunked at the processor, the chunks cross the parallel serial lines, and are reconstructed at the recieving end. if you're using TCP/IP, just go by packet.
      • The problem with parallel buszse is not so much crosstalk which, as you say, can be handled by shielding, as skew. The tiny displacements cause by one conductor passing closer to, say, a large conductive mass such as a mounting screw, or having slightly less capacitance to the ground plane because of nearby vias cause tiny changes in the transit time of the signals. As data rates get higher, these tiny skews begin to approach the length of a bit. And this problem gets worse as the bus gets longer, whereas serial lines you can just bung up an optical transceiver and they run (nearly) for ever.

        If you look at all the high-speed "busses" being developed now, they are all going serial. Sometimes thay may be multi-serial. I thing 10GHz Etherenet is actually 4x2.5GHz, and Infiniband gangs up to 12 x2.0GHz channels. But each of these channels is actually a stand-along serial channel with its own self-clocking data, not a classic parallel port.

    • Ah, so that's why all this talk of Serial ATA. Makes sense now...thanks for taking the time to explain.

      My practical experience is that most bottlenecks I see these days are related to latency more than bandwidth.

  • try channel bonding. add a 2nd NIC to each computer and another switch. This can double your bandwidth simply and rather cheaply.
    • Every switch vendor has something like this that interops fairly well. Cisco calls it Etherchannel. I think Extreme calls it channel bonding.

      If you are only going to 200meg, there is no point in building new switchs and so forth when you can already go up to 400meg with channel bonding. I could be wrong there though, as I think you can bond up to 8 ports with Etherchannel.

      Just hit Ebay and pick up some 2900's or 3500's and link them together.
  • The bigger issue here is the length of the cable. To get decent bandwidth, parallel communication has to cut down on cable length (see scsi,ide). With serial communication you can go for hundreds of feet (ethernet). For a small rack, such as the one you describe, this might be possible - but cost prohibitive. Think about the cost of an external ultra-160 cable - $100? For 1m? My prices are dated, probably, but still. Remember too that it's not bandwidth that's the problem in clusters - it's latency. That's why Myrinet is so damn costly. Even current cluster interconnects don't need more than 10Mb/s bandwidth - but they need as little latency as is possible.
    • According to pricewatch.com [pricewatch.com] the current going price is $30 - $49
    • Remember too that it's not bandwidth that's the problem in clusters - it's latency. That's why Myrinet is so damn costly. Even current cluster interconnects don't need more than 10Mb/s bandwidth - but they need as little latency as is possible.

      I agree that latency is very important for some problems. But for others, bandwidth is as important. One application my lab uses needs to transfer interfacee information between nodes for each timestep. So every 60 second of computation we need to transfer roughly 100MB of data. So GigE makes this transfer take about 2s instead of the ~15s of FastE.This is significant.
  • by 3-State Bit ( 225583 ) on Thursday November 14, 2002 @05:47PM (#4672266)
    Usually one runs highly parallelizable things on clusters like this. Which means that the computation can be split into nodes easily, without having to constantly share much data between nodes. If you're not highly paralleled, then 12.5 megabytes a second (because that's what 100bt is) is going to slow you down less than having a slow front-side bus. (100 mhz? -- the point is, if /that/ is what limits your computation, versus your processor speeds, because you aren't parallelized, then maybe a cluster isn't your best bet.)

    Consider:
    If your nodes need to share more than 12.5 megs of data second, then you might as well be running 100 megahertz processors.

    Of course, I could just be talking out my ass.
  • by Outland Traveller ( 12138 ) on Thursday November 14, 2002 @05:49PM (#4672282)
    I've been experimenting with Gigabit Ethernet lately.

    The good news is that it's less expensive than you think. Decent cards are only marginally more expensive than good 100bT cards, and netgear now makes a reasonably prices 8port gigibit switch. It doesn't support jumbo frames but it's quite usuable for small networks.

    The bad news is that I'm finding that gigabit ethernet doesn't deliver the performance you might expect using traditional network protocols. NFS in particular sees only modest gains, even when using nfsv3 and increasing the block sizes and tuning the kernel buffers/TCP options. I'm still showing bandwidth bottlenecks on the network when I should be seeing bandwidth bottlenecks on the disk array.

    It would appear that something isn't scaling. Given that network benchmarking tools do show gigabit ethernet performing at a reasonable speed, it would appear that most "legacy" protocols are not architected to take advantage of it.
    • What's your ttcp raw TCP throughputs?
    • You just said it: your switch doesn't support raw frames. This is critical to get decent performance out of GigE. The explanation is simple: GigE still uses CSMA/CD, even though I don't know of any GigE hub. Every time a NIC needs to send out a frame, it needs to listen to see if anyone is already transmitting. This is defined as a number of ms and is the same length of time from 10Mbps to GigE, to allow compatiblity. So the longer your frames are, the smaller the relative time wasted waiting. Also the ethernet frame header is the same no matter what the frame size is, so the overhead cost gets less important with larger frames.
    • Regarding cheap gigabit ethernet controllers Intel has a rather new one (I think) "INTEL PRO/1000 MT DESKTOP ADAPTER". Here in Sweden it's only ~$10 more than a comparable 10/100 card.
    • It doesn't support jumbo frames

      The bad news is that I'm finding that gigabit ethernet doesn't deliver the performance you might expect using traditional network protocols



      I think you answered your own question here. My experience is that Jumbo frames are absolutely necessary to get the full performance out of Gig Ethernet (we get 600 Mbit/sec user data between two pons on GbE). Ordinary packets are just too damned small to be able to fill the pipe.

  • current limitations? (Score:4, Informative)

    by call -151 ( 230520 ) on Thursday November 14, 2002 @05:51PM (#4672298) Homepage
    One thing that comes up a lot with Beowulf clusters that people don't always realize is deciding what the bottleneck is. There are basically three possible limitations:
    • Processor-bound clusters- these have adequate network and storage support and are held back by the number of CPUs.
    • Network-bound clusters- these have inadequate network capability and much of the time the processors are waiting for information over the network.
    • Storage-bound clusters- these can have adequate processor-to-processor network capabilities, but share a slow hard drive, so time is spent waiting for IO.

    Of course, the same cluster can be bound in different ways depending upon the applications that are being run. It is important to realize what the limitations are for your desired tasks and focus your improvements there. I have seen several clusters where they spent an ungodly amount of money on Mirrinet and a massive amount of time getting it working when they were running easily-parallelizable tasks that were really bound just by the number of CPUs.
  • by tolldog ( 1571 ) on Thursday November 14, 2002 @05:57PM (#4672365) Homepage Journal
    This is why a beowulf cluster is not always the answer.

    Depending on what you are trying to solve, the problem may need to be split up differently. The algorithm to solve the problem and the system you are using need to match well.

    Beowulf is great for high cpu intensive tasks with low network useage. Other forms of clustering are good for problems that use shared memory (but this starts to nail the network). Some tasks split up so that just a simple queuing system is all that is needed to do the work, all that you need to be able to do is have a manager job determine who does what and if it was done.

    -Tim

    • Indeed - and it depends on what your cluster is supposed to be beneficial for. Ideal for "number crunch" clustering are tasks that require low bandwidth and high CPU performance - like movie rendering or testing alternative simulation parameters. For the latter projects like
      SETI@home [berkeley.edu],
      distributed.net [distributed.net] or
      Folding@home [stanford.edu] have become famous. Most CPU work, neglectible network load. For SETI@home I have an average network throughput of ~50 bit/second. To saturate a 100Mbit/s network (not even switched) with SETI@home you'll need approx. one million (1.000.000) PCs.


      As for network - do you need throughput or low latency? Depeding on your problem small changes in algorithm can do wonders. E.g. for film rendering you might choose a few NAS and a hoard of dumb/diskless rendering slaves. If you copy the model libraries (for the included figures, textures, etc.) onto a local disk at the beginning of a scene render run, you will decrease net load a big deal (I've done that with Provray [povray.org] rendering myself).


      If you don't have the rerssources to buy e.g. Myrinet, try alternative architectures if they might fit your problem, e.g. hypercubes (see other posts) or models like Flat Neighbourhood [aggregate.org].

      • I couldn't agree more. I have done work with hypercubes... sure it was only a stupid sort algorithm, but it was order N when there were N Procs. (ok... it was 2N, but we know the multiplier is ignored).

        Too often people jump on the "beowulf mentality" without really knowing what is out there and how they differ.

        When we render, we copy the scene file local and render the images local. Then the images get copied back "online" so that the NFS servers do not get killed with open files for long periods of time. When it takes 20 some hours to write 1.8 mb of data over an NFS connection (with over 500 machines doing the exact same thing) the system is less than optimized.

        I was just thinking about creating cheap hypercubes with net over firewire. As long as the routing is setup, you could have 4 connections on a machine plus the 10/100/1000 ethernet connection. A hypercube only needs 3. And one could get multiple cards. I know that PVM works well in such a situation as long as the procs initialize in the right order (proc 1 is always proc 1). Maybe this would be a good reasarch project.

        I know when I was at school we worked on a 8 node system setup as a hypercube (it had 2 groups of 4 procs, so some of the "hypercube connections" were over a scsi bus instead of a true connection between the procs.) The procs were slow and the machine was expensive. Now somthing faster (with a slower backplane) can be built for less.

        -Tim
  • by QuietRiot ( 16908 ) <cyrus@80[ ]rg ['d.o' in gap]> on Thursday November 14, 2002 @06:24PM (#4672594) Homepage Journal
    Write a driver that passes data between machines via the SCSI interface. Put each host controller in the chain on it's own ID, tie the networking part of the kernel into the SCSI part of the kernel, wave your magic wand and - **Presto!** Fast, parallel communications (with a lot of the headaches of the communication protocol taken care of by the SCSI command set -- allows for "concurrent" connections between multiple "devices" easily).

    To scale, put multiple controllers in a high bandwith machine, moving data between chains. With 8 machines, there'd be no need because they could all fit on one.
  • DUDE! (Score:4, Funny)

    by floydigus ( 415917 ) on Thursday November 14, 2002 @06:27PM (#4672633)
    Imagine one node of this thing on it's own!
  • Why don't you just go out and buy a Cray???
  • I do remember a while ago an article (which I believe was on /.) about a B'wolf like system that had two or three ethernet Base 100 cards per box instead of Gigabit. IIRC the author claimed that this was actually faster that a gigabit network due to various things (overhead with Gigabit, some other stuff). I also recall that one of the interesting aspects of the system was the way he had all the topology of the boxen and switches. Apparently he had done some analysis to determine how everything should be wired for maximum thruput / minimum switch saturation.

  • With my beowolf underperforming, I was considering this idea:

    the nodes are kept REALLY close together... maybe stacked. custom-made PCI cards have their entire 32-bit pins ganged together... However a simple and really fast (66MHz?) 32-bit latch or multiplexer is needed for each pci card. The 'bus' of 32 wires runs across all the nodes... practically this should be no more than 2 nodes. The cpu sends pci data to a specific address.

    Now the pci card has one address where it listens from from the outside.. the 32 wires. On the inside it simply blasts out all pci data to other pci cards. a bit like ethernets connected to a hub. Unless we can build a switching mechanism, 2 nodes will be easy to start with. (update: will need a simple buffer to counter the lack of synchronization between the pci clocks of the nodes) So when pci data comes in from the cpu, the card blasts it out to the 32 wires.. and the other card listens if it hears its address, puts data in its buffer which on the next clock interrupts the cpu and goes to the cpu.

    Other variations include using just the DMA to put the data straight into a software-allocated region of memory. Please note I'm thinking multiprocessing systems rather than this gentleman's beowulf. With shared memory IPC, we can have a ball.

    Using the PCI directly has the limitation of distance... not more than 2 feet I think. But I dont remember if this was for 33MHz or 66. I am also running the risk of not understanding the workings (and other unknown limitations) of PCI.

    I hereby patent these Ideas under the BSD license.
  • Oracle has just released [slashdot.org] some firewire drivers/patches that they claim can be used to create firewire clusters. We use it at my work for networking on w2k machines using this [unibrain.com] and it's really fast with lower latency than 100Mbit network.
  • Wouldn't a parallel network have higher latency? As you increase speed of serial connection, it takes less time for data to get to it's destination. If you have a parallel network, no matter how many paths you add, it still takes the same ammount of time for data to get to it's destination.

    Am I missing something? I'm not an expert on networking.
  • Would it be possible to add a few firewire cards to the master node, and connect each node to the master pc?

    I only mention this, because IEEE 1394 was created with networking in mind, as it (should) work fundamentally the same as ethernet. It runs at 400mbps (which is really what most copper gigabits are limited to).

    In addition, once you pass the 400mbps mark, you also have to factor in the bus speed limitation of the nodes in the cluster. THE 32-BIT PCI BUS CANNOT TRANSFER HUGE AMOUNTS OF DATA. 64-bit pci is availible on newer expensive workstation boards, but as far as costs are concerned, it's almost as impractical as myrinet (for which a NIC will set you back about $1500, and require pci-64 as well)

    A final suggestion: Why not put 3-4 100mbps ethernet cards in eaach node? I believe this is possible, and has been done before... Since 100mbps cards cost under $10, this seems like a very attractive solution. For even more speed, it might be possible to put each nic on a different subnet, and use 3-4 switches - but I don't know if this would be possible or even help.

  • Picture the following scenario (we'll use 4x parallelism as an exmaple, you could go with whatever number you fancy):

    You have an alternate protocol, lets call it PIP (Parallel IP), and you code for it in your network stack. It acts like IP and normal TCP/UDP will run on top of it. On a per-interface basis you decide that your public card talks IP and the 4 cards to the private beowulf network talk PIP.

    In some config file, you define that eth1-eth4 are part of a PIP group with a single IP address and four MAC addresses. You probably have to define all the beowulf nodes in a central file that you distribute to all the nodes (so they know the mac/ipaddrs of the remote nodes).

    The PIP stack takes each would-be IP frame, and instead of encapsulating it in an ethernet frame bound for the single mac-address of the recipient IP address, it split the packet into four chunks with unique serial number 34534568 and sequence numbers 1-4 in an extremely short PIP header that goes on top (inside the ethernet framing). You lookup the 4 MACs for the destination IP (PIP) address, and you send packet #1 out your MAC#1 destined for their MAC#1, etc...

    On the receiving side, the PIP stack looks for all four sequence numbers to reconstruct packet 34534568. When and if it receives all four, they are reconstructed and passed upwards. After some minimal timeout value (a few milliseconds at most?) if only some of the sequence numbers have been received, toss the packet out and let TCP (or the app in case of UDP) deal with the lost packet, after all IP isn't gauranteed delivery.

    I'm of course leaving out lots of little implementation problems, but that's what you get in a 5-minute slashdot idea :) The nice thing about a plan like this is that it's transparent to your Layer-2 switches (layer-3 switching wouldn't understand the packets though), and it's transparent to UDP and TCP applications.

    You might also want to handle ICMP in some custom fashion, like sending a ping down all four mac-pair-paths and telling the ping program you got a valid echo when/if you see all four replies.

    The downside of course is that it's not really gauranteed to give any latency decrease due to the problem of lost packets and waiting for all four to sync up and whatnot - although it would almost definitely result in an overall bandwidth increase for a tcp stream.

    Anyways, food for thought...

    • Another thought on the pros and cons

      Pro: You could hook eth1 of all the machines to switch1, and eth2 all to switch2, etc... and use independant switches for each parallel communication path, and it owuld give better performance than running it all into one switch and would still work.

      Con: N-way parallelism in this fashion leads to multiplying your chance of total network failure by N (4 way means you've now got 1/4 the normal MTBF on your network cards/switches/etc). Of course the biggest con is that to go 8 way would require a lot of PCI slots. I think some PCI card vendors make dual ethernet cards, but I'm not sure if any are linuxable. Sun makes a quad fast ethernet PCI card for their boxes, maybe the driver from spaclinux can work with a few changes putting the card in an x86 box?
  • ...as there's a built-in /.() function. :-)
  • You don't say why you need optimum performance, but I'm guessing that since you are a sysadmin, you had some parts and some curiosity.

    Just what is it you do with this hobby, other than annoy your gf?
  • The problem in most beowulf-type clusters is not bandwith, it's _latency_. That's what you pay for in Myrinet cards: an on-card cpu which gives ridiculously low latencies. And that's what BigIron has: backplanes which allow not only high throughput, but esp. low latency between cpus.

    The area where it's useful to use a cluster is in problems where you can split the calculation in small pieces. Typically you don't need to send tons of data across processors, but you need to do it often and quickly. That's where latency kills you.

    So you have an idea which is a solution looking for a problem. Now, since others have already pointed that high-bandwith parallel systems are ultimately a bad idea, it turns out you have a _bad_ solution looking for a problem.

    I'm not saying that there aren't ways of improving life in the clustering world. But yours isn't one of them. Sorry.
    • The area where it's useful to use a cluster is in problems where you can split the calculation in small pieces.

      hmm, not to sound too much like a purist, but it is useful to use a cluster when you can split the calculation in LARGE pieces. When pieces are SMALL you need communication often and get bit by latency...

      Paul B.
    • Absolutely wrong,[well... not absolutely :)]

      Latency is nowhere near as important as bandwidth. If you can successfully overlap communication with computation in say an MPI application you achieve much better wall clock times for the runs of big jobs.

      I have no idea why people are so obsessed with latency in clusters... perhaps the apps are coded with all blocking communication semantics and cannot do the overlap I described above. Basically if you can stream data to the processors and keep them busy ... that's all you need... latency may or may not be important in that situation.

      I actually develop a commercial MPI implementation and we have seen real evidence of my claims above. If you are going to SuperComputing 2002 in Baltimore Maryland next week stop by the MPI Software Technology Inc. booth [I won't actually be there... but some of my demos will be]... ask some questions and try to talk to one of our technical guys... they may be better able to explain this since I am not a senior engineer. :)

      [This should be taken not as a representative statement on behalf of the company I work for but as my own humble interpretation of the real issues].
  • One of the things with beowulf clusters is that it's not necessarily the bandwidth that you need to worry about, but rather the latency of the network. For instance, if I ping a machine on my 100mbps network, I get a round trip time of .350 ms. Myrinet, on the other hand, has latencies in the single digit microseconds. Those interfaces are there less for the amount of data that they can push, and more for their response time, giving the cluster a more "whole" feel.

    If your goal is to increase bandwith, I would use 4-port 100mb cards and make a hypercube out of the thing. You won't be limited by the backplane of your switch (which is what? if it's a $100 switch, don't expect to be pushing 3 machines at full keel down it), since they will have dedicated 100mbps connections to each node. relatively cheap, and many examples out there.

    You might also take a look at SCALI, although it comes into the range of myrinet, but I think it does exactly what you want.

    Gigabit switches with all ports being gigabit are about $450, and is quite speedy.

    One potential is for you to look at VIA over gigabit, or to use PM (GM is for myrinet, this is from the real world computing somethingorther in Japan). Essentially, the isue with doing regular tcp/ip communications is that tcp/ip is slow, but pretty fault tolerant. these other protocols are not nearly so fault tolerant, but since they are on dedicated networks, they don't need to be.
  • HIPPI [hippi.org] is a high speed parallel interconnect.

    You'd need to design a switch for it yourself however as it's a point to point system.
    It currently tops out at 1600Mb/s which is greater than what 33 MHz PCI can do anyway.

    I'm pretty sure the 1394b (Firewire) standard was designed to go up to 1600Mb/s as well.
  • You notice those squigly lines on your motherboard? They're not just to look pretty, they actually slow the signal down so that the 8, 16, 32, 64, etc parallel bits all arrive at their destination at the exact moment before the clock signal latches in the data.

    Parallel solutions on flexible cable will not be viable anywhere near the average front side bus speed, nevermind faster interfaces. Sorry.

    A not too recent slashdot article pointed out a unique way to manage large data transfers among clusters though:

    As a simple example, each computer has N 100Mb interfaces. There are a total of J computers. you use K hubs (or, ideally, switches), and connect them together such that one computer has direct hub access to a portion (or all) of the cluster. You can hook the hubs together using a managed switch or router if you want to enable them to contact all the computers in under two hops, without routing through other computers. This is very flexible - you could have it set up so each computer has more than one line going to other computers. Using managed switches you could even reconfigure it on the fly, though your latency goes up.

    The idea here is that each computer has a possible bandwidth of multiple times the network card speed. Furthermore, collisions are down, and latency is down. Throughput can reach higher than the practical 80% of ethernet bandwidth.

    The real problem with ethernet is latency and collisions. The reason it's used is because it's dirt cheap. Dirt Cheap. Commit that to memory. It's often better to buy a lot of slow devices than one expensive really fast device. You can overcome most of the limitations of the slow devices with other tricks.

    So, while your idea has some merit, the reality is much different than the ideality. :-)

    -Adam

  • use multiple firewire ports. Many newer chipsets are comming with 2 firewire built in, NOT on the PCI bus. you wouldn't have the PCI bus limit of ~132MByte/second. 2 build in, plus whatever you wanted to put on the PCI bus.

    Also, 66mhz/64bit PCI has 528MByte/second available on the bus, so firewire on 64/66pci would give you ~11 ports at full speed.

    Many machines that have 64/66pci have multiple pci buses so you have 528Mb/bus.
  • Really, the thing that needs to be done is improve the *protocol* that the communciation rides upon. If you're using gigabit, usually the limiting factor is the processor- you max the processor (doing the overhead for TCP/IP) before you even come close to filling up the gigabit pipe. There are technologies like Infiniband which try to address that, but progress has been very slow.

    There is a big push for TCP offload- to take the effort for TCP and put it into an ASIC- check out this paper [10gea.org]. TCP offload is absolutely necessary if we want to go beyond 1 gigabit ethernet any time soon, because the rate that we can communicate is beating the rate of increase in processing. IIRC Moore's law is an 18 month doubling time, there is a similar law for communications speed that says it doubles at a 12 month rate.

    In addition, the engineering hassles of making a device with so many paralell channels is pretty hard- When you start doing high speed communications, you have to take a *lot* of care to make sure that the signals get there as you want them. This is why Intel appears to be in the process of moving away from PCI (a paralell bus) and moving to PCI-Express/3GIO, a combination paralell/serial bus, with a variable number of serial channels. PCI-Express is designed to carry on 16 wires what PCI-X carries on 80+ wires. When you actually have to make hardware- this makes a *HUGE* amount of difference.
  • Hmmmm - use lots of parallel wires at a lower speed rather than higher speed serial links. Sounds like you just rediscovered the idea that drove HIPPI (Hi-Performance-Parallel-Interconnect) [everything2.com] back in the 60's

    Actually - there are a lot of other posts here that talk about why parallel necessarily isn't better. At high speeds with long cables it can be a pain in the ass to keep all the bits lined-up. Sometimes it is easier to develope hardware that does serial faster than it is to add more wires and do it parallel.

  • You've got an interesting idea here but, frankly I think you have forgotten a couple of steps along the way.

    When ever you increase the performance of one area in a system, another area becomes the bottleneck. Right now, your bottleneck is assumed to be the network.

    As you said you could possibly, do bonding on the interfaces to multiply the bandwidth. However, to accomplish this you will need a better switch that is capable of this bonding. Nortel offers a small switch that can do this, the Baystack 450 is the lowest end capable of what they call multi-link trunking. Cisco offers several switches that are also capable of this. Cisco calls it ether-channel and I think the Catalyst 3500 is their entry point switch with this capability. These switches can bond up to 4 links, I believe, giving you up to a 400Mbps trunk per switch. To do this with multiple systems you would really need a larger and significantly more expensive switch.

    You dismissed Gigabit saying that you will suffer the same limitations due to serial communication. However, I believe that Gigabit will be the best solution in this case. Your gigabit performance will not be limited by serial communication as you think. Gigabit communication will be limited by the PCI bus in the system. In fact, your system will not be able to drive gigabit cards beyond 600Mbps because of the PCI bus speed. If you are plugging a disk controller into this PCI bus and further sharing the PCI bandwidth you will further limit the performance that you can expect from your gigabit cards.

    There are only two ways around this limitation and only by combining them will you be able to truely maximize the gigabit performance. The first method is to have a system with dual peer PCI buses. This allows you to run disk controllers on one bus and network controllers on another, giving you the full bandwidth of the bus to each controller. This still limits you to around 600Mbps though. To go beyond that, you need to have a 64 bit PCI bus, preferrably dual peer. The drawback here is that you will need to acquire a true high-end server to get these features. Desktop systems and low-end servers do not have 64 bit PCI buses and rarely have dual peer PCI buses.

    So, basically gigabit ethernet will provide the highest possible performance on your existing systems. If you want to go beyond that you need to replace your systems with high-end servers with multiple 64 bit PCI buses, running multiple gigabit NICs, multi-link trunked to a high-end gigabit switch.

    One final note about your parrallel processing idea, it's a good idea. It's also been done, more or less. 3Com NICs have had a somewhat similar feature that they call "Parrallel Tasking" for years now. The Parrallel Tasking cards off load network operations to a processor on the NIC, leaving the main CPU free for other operations. I'm sure there are other NIC manufacturers that do similar things but, 3Com has always hyped up this feature so they are the first to come to mind. But no matter how fast your processor and your NIC card, you still have to cross a PCI bus and at 32 bits wide it is the PCI bus that will be your bottleneck.
  • For the record, a quick and easy way to increase your network performance is to use quality network cards. Once upon a time I believed that 100Mb cards all moved data at 12.5 mega bytes per second (theoretical max throughput at 100megabits per second) but testing showed that good cards ( Intel, SMC, 3Com, etc...) moved data upwards of TWICE as fast as the cheap cards (Addtron, Eagle, NE2000 clones, etc...) I didn't test latency but envision similar results.
  • Thanks So Far (Score:2, Informative)

    by Anonymous Coward
    Thanks to all the posters so far, there are a couple of good ideas here. To all of you who suggest using Gigabit or Firewire, you've missed the point because what I'm talking about here is a way of sending packets across a parallel connection and (possibly) getting a x16 increase in speed. The poster who suggested linking the PCI bus of each node using latches please get in touch, this is EXACTLY what I am looking for.



    Finally, a few of you have suggested that bit skew would be a problem due to cable length at high speeds. I don't think this is so otherwise ATA 100/133 hard drives wouldn't work.



    As I suggested in the original post. Is there a Computer Engineering student who would like to take this up as a final year project ?



    Peter Gant

  • It sounds like you want a NIC with 2 100mbit links that acts 1 200mbit link. OK, but how is that really a new architecture over using some sort of software channel bonding. It is built into linux, and was developed (in linux at least) for the original beowulf cluster. With it, one can easily use one or two large switches and a couple of quad fast ethernet cards to get 400 or 800 mbits of bandwidth in and out of nodes.

    And the idea is also firmly established in the NAS market. I don't think it is in use in the SAN market, but that is probably because most SAN venders are using either GigE or FC-AL. However, once iSCSI is more commonly available (I believe it is in progress for NetBSD, don't know what the linux status might be), I'd imagine running it over channel bonded links would also be no trouble.

He has not acquired a fortune; the fortune has acquired him. -- Bion

Working...