Developing a New Beowulf Architecture? 86
"Both gigabit ethernet and Myrinet still have one fundamental weakness, a weakness that goes back to the original days of networking, they are a SERIAL medium. Even if you use the fastest technology possible you are still sending bits one at a time down a single pipe. it's like having a single lane highway between L.A. and San Francisco with each car running at 10,000 mph so that you can cope with the bandwidth, it might work but it's a damn silly solution. I therefore propose a new networking solution for use in cluster systems, parallel networking. This isn't as silly as it sounds because we use this solution at work to link two switches, two 100Mb network connections are concatenated together to form a single 200Mb link, but what I propose goes further.
The new system takes advantage of the seven-layer OSI model and separates the new hardware from the operating system. So far as the system is concerned each node has a single network card but the interface is where I propose the change. Every network card includes one or more shift registers which take the parallel information off the PCI bus and convert it to a serial bit stream so that it can be sent along the network cable and when data is received the hardware operates in reverse converting serial to parallel. The new cards replace these shift registers with thirty two (or maybe sixteen) bit latches and the network connector at the back of the card has (say) forty pins. This would allow the use of thirty two pins for data and eight for handshaking and if the new eighty-core IDE cables are used then crosstalk would not be a problem. It's a similar approach to the Digital Video Out connector on some high-end video cards that allow you to connect a flat screen monitor without going through the D to A convertors. Each node has its own cable connecting into the network switch which (as the connections are now thirty two bits wide) would be a 32 x n switch where 'n' would be the number of nodes in the cluster.
Assuming that the idea can fly we would need to develop the following:
1) The new network cards. This isn't as difficult as it seems as a lot of the work has already been done by every network card vendor. With modern ASICs the task of appearing to the system as a NIC whilst presenting the data to the port thirty two bits at a time could be dealt with by a single chip. All it needs is someone to design the chip. If we use standard forty-pin connectors then users can buy the cables off the shelf. To keep things on track we would need to implement all of the NIC functions including giving it a MAC address so that a TCP/IP stack could be implemented.
2) The network switch. A network switch handling data thirty two bits at a time is not a trivial item but I am sure that it can be done. A number of IC manufacturers have crosspoint switches as part of their catalogue and all that needs to be done is to expand the process further. Given the nature of the task it might be possible to carry out the switching using a hardware only solution which would reduce latency even further.
3) The software. Assuming that the new cards appear on the PCI bus as an ordinary NIC then drivers should not be much of a problem. These would probably have to be developed at the same time as the network card. Drivers should include all the required software so that the NIC can work with the kernel but windows drivers as well would be nice.
One final thought, this solution could also be applied to other fields. Want to build a SAN PC and wire it to a pair of servers running My SQL ? Well, you now have a nice fast communication medium.
So, there you have it. Assuming this idea works then we now have a way to increase the speed of a network by reducing the latency rather than throwing more or faster CPUs at the problem. In the spirit of Open Source I do not propose to patent this idea, I want everyone to take the ideas presented here, play around with them, and if a university student is looking for his (or her) final year project they are welcome to give this a try. Should any of you have comments regarding this idea then post away. I should however point out that I'm a great fan of practical criticism, feel free to say that the idea sucks but if you do say WHY it sucks and HOW it can be improved."
Oh my word! (Score:5, Funny)
TOE's (Score:1)
I think that this article [slashdot.org] might be useful to you. It describes a commercially available NIC that does much of the network processing on the NIC.
I think that the price is a bit out of your range for now. My understanding is that they are running about $1000 each. Wait, and the price will drop, tho.
TOES (Score:1)
Mmmm... (Score:5, Funny)
I'll give you ten bucks
Re:Mmmm... (Score:2)
Re:Mmmm... (Score:1)
I'll give you ten bucks ... if she's cute ... and a goer.
Whoa, whoa, whoa. The fact that this guy even has a girlfriends is astounding. Now you want her to be cute and, as you so delicately put it, 'a goer'?
I think you should thank your lucky stars that she's not inflatable and leave it at that.
Firewire or USB2.0 (Score:3, Insightful)
Re:Firewire or USB2.0 (Score:1)
Re:Firewire or USB2.0 (Score:2, Informative)
Re:Firewire or USB2.0 (Score:3, Informative)
Firewire: 400 MBit/s -> 50 Megabytes/sec
which is still a lot faster than 100mbit cards
Re:Firewire or USB2.0 (Score:2)
Mind you, it isn't cheap yet.
Eliminate the Switch (Score:2, Insightful)
That should give you lots more bandwidth and eliminate an expensive switch, and a few nanoseconds of latency.
Re:Eliminate the Switch (Score:2)
Besides, I don't know the price of a 4-port NIC, but I'd imagine they're pretty pricey... considering you'd need 2 for each system in a 8-node cluster, you could probably buy a nice 24-port switch for the same cost.
This guy is talking about a new 'standard' for a cluster of arbitrary size, not some hobbyist's hack in a basement. You need to think in terms of scalability here.
- Jester
Re:Eliminate the Switch (Score:2)
So it's much less expensive than a hub for a small network.
Since I've got pretty crappy hardware everywhere I can't say for sure it's the reason for the bad latency in my 3 pieces network, but it wouldnt be honnest not to mention it.
deBruijn graph (Score:2)
To do this, you only need four interfaces per node (actually, you need two transmit and two receive, so could get away with two, but let's stick to four so we don't have to hack network layer code).
Assuming you have 2^n nodes, you number them from 0 to 2^n-1. Node n is configured to transmit to nodes 2n mod 2^n and (2n+1) mod 2^n. By extention, it receives from nodes n / 2 and n / 2 + 2^(n-1).
This lets you get a message from one node to another in at most n hops with 2^n nodes and four (or two) nics per node.
This is how it was done ! (Score:3, Informative)
of the original Beowulf (Don Becker and Tom Sterling)
and their story goes that it was originally built
exactly like this: a bunch of 4-port cards
connected in hypercube configuration. By the
way, for this case you can scale it to 2^4=16
nodes with 3 hops worst case latency.
The reason for this was that Don at the time was
writing a linux driver for that particular card
and needed some justification for that
activity...
Problem with this approach is that you do message routing in software running on your computational
nodes, not too efficient compared to dedicated
hardware on a switch. Thus, switches were used
ever since...
Paul B.
Re:Eliminate the Switch (Score:2)
Load balancing (Score:4, Insightful)
Serial is faster (Score:4, Informative)
As it happens, parrallel interconnection's days are numbered becuase they are fundamentally limited as tranmission speed increases. As the speed goes up you increasingly have problems with things like interactions between data lines and having the data arrive at the same time on each line. So, ironically, less lines means you can go faster and provide more bandwidth.
Re:Serial is faster (Score:1)
Up the serial speed as much as you want, and you can still up it with a parallel interconnect.
Re:Serial is faster (Score:1)
Re:Serial is faster (Score:2)
This problem is avoided with serial connections.
And as someone already pointed out, this is why Serial ATA is on the rise now.
Re:Serial is faster (Score:1)
it's multi-serial, multiple serial connections running in parallel. A datagram gets chunked at the processor, the chunks cross the parallel serial lines, and are reconstructed at the recieving end. if you're using TCP/IP, just go by packet.
Re:Serial is faster (Score:2)
If you look at all the high-speed "busses" being developed now, they are all going serial. Sometimes thay may be multi-serial. I thing 10GHz Etherenet is actually 4x2.5GHz, and Infiniband gangs up to 12 x2.0GHz channels. But each of these channels is actually a stand-along serial channel with its own self-clocking data, not a classic parallel port.
Re:Serial is faster (Score:2)
Ah, so that's why all this talk of Serial ATA. Makes sense now...thanks for taking the time to explain.
My practical experience is that most bottlenecks I see these days are related to latency more than bandwidth.
Channel bonding. (Score:2)
Re:Channel bonding. (Score:2)
If you are only going to 200meg, there is no point in building new switchs and so forth when you can already go up to 400meg with channel bonding. I could be wrong there though, as I think you can bond up to 8 ports with Etherchannel.
Just hit Ebay and pick up some 2900's or 3500's and link them together.
Re:Utilize the AGP port... (Score:1)
First, the P in AGP already stands for port, so AGP port is as annoying as LCD display. Secondly the AGP interface is rather one-sided. Data can get from system memory to the card very quickly, but not the other way around.
Re:Utilize the AGP port... (Score:1)
You made my day, AC. Nothing cracks me up like infuriating a
serial v. parallel, distance, latency (Score:2)
Price of an external ultra-160 cable: (Score:1)
Re:Price of an external ultra-160 cable: (Score:1)
Compared to $3 for an ethernet cable, the point is still valid.
Re:serial v. parallel, distance, latency (Score:1)
I agree that latency is very important for some problems. But for others, bandwidth is as important. One application my lab uses needs to transfer interfacee information between nodes for each timestep. So every 60 second of computation we need to transfer roughly 100MB of data. So GigE makes this transfer take about 2s instead of the ~15s of FastE.This is significant.
Well, what are you using it for? (Score:5, Interesting)
Consider:
If your nodes need to share more than 12.5 megs of data second, then you might as well be running 100 megahertz processors.
Of course, I could just be talking out my ass.
Working on a similar problem (Score:3, Interesting)
The good news is that it's less expensive than you think. Decent cards are only marginally more expensive than good 100bT cards, and netgear now makes a reasonably prices 8port gigibit switch. It doesn't support jumbo frames but it's quite usuable for small networks.
The bad news is that I'm finding that gigabit ethernet doesn't deliver the performance you might expect using traditional network protocols. NFS in particular sees only modest gains, even when using nfsv3 and increasing the block sizes and tuning the kernel buffers/TCP options. I'm still showing bandwidth bottlenecks on the network when I should be seeing bandwidth bottlenecks on the disk array.
It would appear that something isn't scaling. Given that network benchmarking tools do show gigabit ethernet performing at a reasonable speed, it would appear that most "legacy" protocols are not architected to take advantage of it.
Re:Working on a similar problem (Score:1)
Importance of Jumbo frames (Score:1)
Re:Working on a similar problem (Score:1)
Re:Working on a similar problem (Score:2)
The bad news is that I'm finding that gigabit ethernet doesn't deliver the performance you might expect using traditional network protocols
I think you answered your own question here. My experience is that Jumbo frames are absolutely necessary to get the full performance out of Gig Ethernet (we get 600 Mbit/sec user data between two pons on GbE). Ordinary packets are just too damned small to be able to fill the pipe.
current limitations? (Score:4, Informative)
Of course, the same cluster can be bound in different ways depending upon the applications that are being run. It is important to realize what the limitations are for your desired tasks and focus your improvements there. I have seen several clusters where they spent an ungodly amount of money on Mirrinet and a massive amount of time getting it working when they were running easily-parallelizable tasks that were really bound just by the number of CPUs.
Architecture matching Algorithm (Score:5, Insightful)
Depending on what you are trying to solve, the problem may need to be split up differently. The algorithm to solve the problem and the system you are using need to match well.
Beowulf is great for high cpu intensive tasks with low network useage. Other forms of clustering are good for problems that use shared memory (but this starts to nail the network). Some tasks split up so that just a simple queuing system is all that is needed to do the work, all that you need to be able to do is have a manager job determine who does what and if it was done.
-Tim
Re:Architecture matching Algorithm (Score:1)
SETI@home [berkeley.edu],
distributed.net [distributed.net] or
Folding@home [stanford.edu] have become famous. Most CPU work, neglectible network load. For SETI@home I have an average network throughput of ~50 bit/second. To saturate a 100Mbit/s network (not even switched) with SETI@home you'll need approx. one million (1.000.000) PCs.
As for network - do you need throughput or low latency? Depeding on your problem small changes in algorithm can do wonders. E.g. for film rendering you might choose a few NAS and a hoard of dumb/diskless rendering slaves. If you copy the model libraries (for the included figures, textures, etc.) onto a local disk at the beginning of a scene render run, you will decrease net load a big deal (I've done that with Provray [povray.org] rendering myself).
If you don't have the rerssources to buy e.g. Myrinet, try alternative architectures if they might fit your problem, e.g. hypercubes (see other posts) or models like Flat Neighbourhood [aggregate.org].
Re:Architecture matching Algorithm (Score:2)
Too often people jump on the "beowulf mentality" without really knowing what is out there and how they differ.
When we render, we copy the scene file local and render the images local. Then the images get copied back "online" so that the NFS servers do not get killed with open files for long periods of time. When it takes 20 some hours to write 1.8 mb of data over an NFS connection (with over 500 machines doing the exact same thing) the system is less than optimized.
I was just thinking about creating cheap hypercubes with net over firewire. As long as the routing is setup, you could have 4 connections on a machine plus the 10/100/1000 ethernet connection. A hypercube only needs 3. And one could get multiple cards. I know that PVM works well in such a situation as long as the procs initialize in the right order (proc 1 is always proc 1). Maybe this would be a good reasarch project.
I know when I was at school we worked on a 8 node system setup as a hypercube (it had 2 groups of 4 procs, so some of the "hypercube connections" were over a scsi bus instead of a true connection between the procs.) The procs were slow and the machine was expensive. Now somthing faster (with a slower backplane) can be built for less.
-Tim
Use and existing solution - SCSI (Score:3, Insightful)
To scale, put multiple controllers in a high bandwith machine, moving data between chains. With 8 machines, there'd be no need because they could all fit on one.
Re:Use and existing solution - SCSI (Score:2, Informative)
was implemented back in pre-1.0 kernel days!
I doubt it would be easier to port it to 2.5.xx
than to write it from scratch though...
Paul B.
DUDE! (Score:4, Funny)
Cray (Score:2)
Lazy reply, but might point to right direction... (Score:2)
Flat Neighborhood Network (Score:1)
Klatz 2 (Score:1)
Remember kids, preview your posts.
Re:Lazy reply, but might point to right direction. (Score:1)
Good stuff.
Another variation on the theme. (Score:1)
With my beowolf underperforming, I was considering this idea:
the nodes are kept REALLY close together... maybe stacked. custom-made PCI cards have their entire 32-bit pins ganged together... However a simple and really fast (66MHz?) 32-bit latch or multiplexer is needed for each pci card. The 'bus' of 32 wires runs across all the nodes... practically this should be no more than 2 nodes. The cpu sends pci data to a specific address.
Now the pci card has one address where it listens from from the outside.. the 32 wires. On the inside it simply blasts out all pci data to other pci cards. a bit like ethernets connected to a hub. Unless we can build a switching mechanism, 2 nodes will be easy to start with. (update: will need a simple buffer to counter the lack of synchronization between the pci clocks of the nodes) So when pci data comes in from the cpu, the card blasts it out to the 32 wires.. and the other card listens if it hears its address, puts data in its buffer which on the next clock interrupts the cpu and goes to the cpu.
Other variations include using just the DMA to put the data straight into a software-allocated region of memory. Please note I'm thinking multiprocessing systems rather than this gentleman's beowulf. With shared memory IPC, we can have a ball.
Using the PCI directly has the limitation of distance... not more than 2 feet I think. But I dont remember if this was for 33MHz or 66. I am also running the risk of not understanding the workings (and other unknown limitations) of PCI.
I hereby patent these Ideas under the BSD license.
Why not use firewire? (Score:2)
Might be wrong, but... (Score:2)
Am I missing something? I'm not an expert on networking.
IEEE 1394? (Score:2)
I only mention this, because IEEE 1394 was created with networking in mind, as it (should) work fundamentally the same as ethernet. It runs at 400mbps (which is really what most copper gigabits are limited to).
In addition, once you pass the 400mbps mark, you also have to factor in the bus speed limitation of the nodes in the cluster. THE 32-BIT PCI BUS CANNOT TRANSFER HUGE AMOUNTS OF DATA. 64-bit pci is availible on newer expensive workstation boards, but as far as costs are concerned, it's almost as impractical as myrinet (for which a NIC will set you back about $1500, and require pci-64 as well)
A final suggestion: Why not put 3-4 100mbps ethernet cards in eaach node? I believe this is possible, and has been done before... Since 100mbps cards cost under $10, this seems like a very attractive solution. For even more speed, it might be possible to put each nic on a different subnet, and use 3-4 switches - but I don't know if this would be possible or even help.
Can you do it in software? (Score:2)
Picture the following scenario (we'll use 4x parallelism as an exmaple, you could go with whatever number you fancy):
You have an alternate protocol, lets call it PIP (Parallel IP), and you code for it in your network stack. It acts like IP and normal TCP/UDP will run on top of it. On a per-interface basis you decide that your public card talks IP and the 4 cards to the private beowulf network talk PIP.
In some config file, you define that eth1-eth4 are part of a PIP group with a single IP address and four MAC addresses. You probably have to define all the beowulf nodes in a central file that you distribute to all the nodes (so they know the mac/ipaddrs of the remote nodes).
The PIP stack takes each would-be IP frame, and instead of encapsulating it in an ethernet frame bound for the single mac-address of the recipient IP address, it split the packet into four chunks with unique serial number 34534568 and sequence numbers 1-4 in an extremely short PIP header that goes on top (inside the ethernet framing). You lookup the 4 MACs for the destination IP (PIP) address, and you send packet #1 out your MAC#1 destined for their MAC#1, etc...
On the receiving side, the PIP stack looks for all four sequence numbers to reconstruct packet 34534568. When and if it receives all four, they are reconstructed and passed upwards. After some minimal timeout value (a few milliseconds at most?) if only some of the sequence numbers have been received, toss the packet out and let TCP (or the app in case of UDP) deal with the lost packet, after all IP isn't gauranteed delivery.
I'm of course leaving out lots of little implementation problems, but that's what you get in a 5-minute slashdot idea
You might also want to handle ICMP in some custom fashion, like sending a ping down all four mac-pair-paths and telling the ping program you got a valid echo when/if you see all four replies.
The downside of course is that it's not really gauranteed to give any latency decrease due to the problem of lost packets and waiting for all four to sync up and whatnot - although it would almost definitely result in an overall bandwidth increase for a tcp stream.
Anyways, food for thought...
Re:Can you do it in software? (Score:2)
Another thought on the pros and cons
Pro: You could hook eth1 of all the machines to switch1, and eth2 all to switch2, etc... and use independant switches for each parallel communication path, and it owuld give better performance than running it all into one switch and would still work.
Con: N-way parallelism in this fashion leads to multiplying your chance of total network failure by N (4 way means you've now got 1/4 the normal MTBF on your network cards/switches/etc). Of course the biggest con is that to go 8 way would require a lot of PCI slots. I think some PCI card vendors make dual ethernet cards, but I'm not sure if any are linuxable. Sun makes a quad fast ethernet PCI card for their boxes, maybe the driver from spaclinux can work with a few changes putting the card in an x86 box?
TYou describe a "Flat Network" (Score:1)
So long... (Score:1)
I don't think you should be playing with Beowulf (Score:1)
Just what is it you do with this hobby, other than annoy your gf?
*Latency* is the issue (Score:1)
The area where it's useful to use a cluster is in problems where you can split the calculation in small pieces. Typically you don't need to send tons of data across processors, but you need to do it often and quickly. That's where latency kills you.
So you have an idea which is a solution looking for a problem. Now, since others have already pointed that high-bandwith parallel systems are ultimately a bad idea, it turns out you have a _bad_ solution looking for a problem.
I'm not saying that there aren't ways of improving life in the clustering world. But yours isn't one of them. Sorry.
Re:*Latency* is the issue (Score:1)
hmm, not to sound too much like a purist, but it is useful to use a cluster when you can split the calculation in LARGE pieces. When pieces are SMALL you need communication often and get bit by latency...
Paul B.
*Latency* is not really the issue (Score:2)
Latency is nowhere near as important as bandwidth. If you can successfully overlap communication with computation in say an MPI application you achieve much better wall clock times for the runs of big jobs.
I have no idea why people are so obsessed with latency in clusters... perhaps the apps are coded with all blocking communication semantics and cannot do the overlap I described above. Basically if you can stream data to the processors and keep them busy
I actually develop a commercial MPI implementation and we have seen real evidence of my claims above. If you are going to SuperComputing 2002 in Baltimore Maryland next week stop by the MPI Software Technology Inc. booth [I won't actually be there... but some of my demos will be]... ask some questions and try to talk to one of our technical guys... they may be better able to explain this since I am not a senior engineer.
[This should be taken not as a representative statement on behalf of the company I work for but as my own humble interpretation of the real issues].
What's your goal?? (Score:2)
If your goal is to increase bandwith, I would use 4-port 100mb cards and make a hypercube out of the thing. You won't be limited by the backplane of your switch (which is what? if it's a $100 switch, don't expect to be pushing 3 machines at full keel down it), since they will have dedicated 100mbps connections to each node. relatively cheap, and many examples out there.
You might also take a look at SCALI, although it comes into the range of myrinet, but I think it does exactly what you want.
Gigabit switches with all ports being gigabit are about $450, and is quite speedy.
One potential is for you to look at VIA over gigabit, or to use PM (GM is for myrinet, this is from the real world computing somethingorther in Japan). Essentially, the isue with doing regular tcp/ip communications is that tcp/ip is slow, but pretty fault tolerant. these other protocols are not nearly so fault tolerant, but since they are on dedicated networks, they don't need to be.
HIPPI (Score:1)
You'd need to design a switch for it yourself however as it's a point to point system.
It currently tops out at 1600Mb/s which is greater than what 33 MHz PCI can do anyway.
I'm pretty sure the 1394b (Firewire) standard was designed to go up to 1600Mb/s as well.
Lots of good replies here, but cost is your issue. (Score:2)
Parallel solutions on flexible cable will not be viable anywhere near the average front side bus speed, nevermind faster interfaces. Sorry.
A not too recent slashdot article pointed out a unique way to manage large data transfers among clusters though:
As a simple example, each computer has N 100Mb interfaces. There are a total of J computers. you use K hubs (or, ideally, switches), and connect them together such that one computer has direct hub access to a portion (or all) of the cluster. You can hook the hubs together using a managed switch or router if you want to enable them to contact all the computers in under two hops, without routing through other computers. This is very flexible - you could have it set up so each computer has more than one line going to other computers. Using managed switches you could even reconfigure it on the fly, though your latency goes up.
The idea here is that each computer has a possible bandwidth of multiple times the network card speed. Furthermore, collisions are down, and latency is down. Throughput can reach higher than the practical 80% of ethernet bandwidth.
The real problem with ethernet is latency and collisions. The reason it's used is because it's dirt cheap. Dirt Cheap. Commit that to memory. It's often better to buy a lot of slow devices than one expensive really fast device. You can overcome most of the limitations of the slow devices with other tricks.
So, while your idea has some merit, the reality is much different than the ideality.
-Adam
parralell, kinda (Score:1)
use multiple firewire ports. Many newer chipsets are comming with 2 firewire built in, NOT on the PCI bus. you wouldn't have the PCI bus limit of ~132MByte/second. 2 build in, plus whatever you wanted to put on the PCI bus.
Also, 66mhz/64bit PCI has 528MByte/second available on the bus, so firewire on 64/66pci would give you ~11 ports at full speed.
Many machines that have 64/66pci have multiple pci buses so you have 528Mb/bus.
The network isn't the problem, TCP/IP is (Score:2)
There is a big push for TCP offload- to take the effort for TCP and put it into an ASIC- check out this paper [10gea.org]. TCP offload is absolutely necessary if we want to go beyond 1 gigabit ethernet any time soon, because the rate that we can communicate is beating the rate of increase in processing. IIRC Moore's law is an 18 month doubling time, there is a similar law for communications speed that says it doubles at a 12 month rate.
In addition, the engineering hassles of making a device with so many paralell channels is pretty hard- When you start doing high speed communications, you have to take a *lot* of care to make sure that the signals get there as you want them. This is why Intel appears to be in the process of moving away from PCI (a paralell bus) and moving to PCI-Express/3GIO, a combination paralell/serial bus, with a variable number of serial channels. PCI-Express is designed to carry on 16 wires what PCI-X carries on 80+ wires. When you actually have to make hardware- this makes a *HUGE* amount of difference.
Re-inventing.... again.... (Score:2)
Actually - there are a lot of other posts here that talk about why parallel necessarily isn't better. At high speeds with long cables it can be a pain in the ass to keep all the bits lined-up. Sometimes it is easier to develope hardware that does serial faster than it is to add more wires and do it parallel.
Hold on a second. (Score:2)
When ever you increase the performance of one area in a system, another area becomes the bottleneck. Right now, your bottleneck is assumed to be the network.
As you said you could possibly, do bonding on the interfaces to multiply the bandwidth. However, to accomplish this you will need a better switch that is capable of this bonding. Nortel offers a small switch that can do this, the Baystack 450 is the lowest end capable of what they call multi-link trunking. Cisco offers several switches that are also capable of this. Cisco calls it ether-channel and I think the Catalyst 3500 is their entry point switch with this capability. These switches can bond up to 4 links, I believe, giving you up to a 400Mbps trunk per switch. To do this with multiple systems you would really need a larger and significantly more expensive switch.
You dismissed Gigabit saying that you will suffer the same limitations due to serial communication. However, I believe that Gigabit will be the best solution in this case. Your gigabit performance will not be limited by serial communication as you think. Gigabit communication will be limited by the PCI bus in the system. In fact, your system will not be able to drive gigabit cards beyond 600Mbps because of the PCI bus speed. If you are plugging a disk controller into this PCI bus and further sharing the PCI bandwidth you will further limit the performance that you can expect from your gigabit cards.
There are only two ways around this limitation and only by combining them will you be able to truely maximize the gigabit performance. The first method is to have a system with dual peer PCI buses. This allows you to run disk controllers on one bus and network controllers on another, giving you the full bandwidth of the bus to each controller. This still limits you to around 600Mbps though. To go beyond that, you need to have a 64 bit PCI bus, preferrably dual peer. The drawback here is that you will need to acquire a true high-end server to get these features. Desktop systems and low-end servers do not have 64 bit PCI buses and rarely have dual peer PCI buses.
So, basically gigabit ethernet will provide the highest possible performance on your existing systems. If you want to go beyond that you need to replace your systems with high-end servers with multiple 64 bit PCI buses, running multiple gigabit NICs, multi-link trunked to a high-end gigabit switch.
One final note about your parrallel processing idea, it's a good idea. It's also been done, more or less. 3Com NICs have had a somewhat similar feature that they call "Parrallel Tasking" for years now. The Parrallel Tasking cards off load network operations to a processor on the NIC, leaving the main CPU free for other operations. I'm sure there are other NIC manufacturers that do similar things but, 3Com has always hyped up this feature so they are the first to come to mind. But no matter how fast your processor and your NIC card, you still have to cross a PCI bus and at 32 bits wide it is the PCI bus that will be your bottleneck.
All NIC's are not created equal. (Score:1)
Thanks So Far (Score:2, Informative)
Finally, a few of you have suggested that bit skew would be a problem due to cable length at high speeds. I don't think this is so otherwise ATA 100/133 hard drives wouldn't work.
As I suggested in the original post. Is there a Computer Engineering student who would like to take this up as a final year project ?
Peter Gant
Not new (Score:1)
And the idea is also firmly established in the NAS market. I don't think it is in use in the SAN market, but that is probably because most SAN venders are using either GigE or FC-AL. However, once iSCSI is more commonly available (I believe it is in progress for NetBSD, don't know what the linux status might be), I'd imagine running it over channel bonded links would also be no trouble.