What Do You Need To Watch For In A Linux SMP System? 23
thefin asks: "My research group has finally received funding (~$200K) for a single SMP box. I've looked over the offerings from SUN, IBM and Compaq. I was wondering what others think of the SMP offerings and in particular, opinions of Linux as a SMP OS. What should we know before purchasing such a machine? What should we look for and what should we avoid? We will be using the box for large, individual based ecological modeling efforts. These models are CPU intensive, and make heavy use of inter-node communications. In particular, inter-node communication can be a serious bottle-neck in our models."
distributed (Score:3)
Ho-hum (Score:3)
Secondly, if internode communication is as high as you say and you can't parellelise your code/algorithms you made a good choice with SMP (as opposed to beowulf). Avoid x86, go for Alpha. Avoid Linux or *BSD, go for the native OS which scales to multiple CPUs much better.
If you can parrellelise your code the obvious thang is beowulf and cheap x86 boxes.
-- Eat your greens or I'll hit you!
Re:Ho-hum (Score:3)
-- Eat your greens or I'll hit you!
I love linux... (Score:1)
If you mean to consider a beowolf or MOSIX cluster of linux machines, then linux is a decent choice, but a cluster does not function the same as a shared memory supercomputer, and you will have to write programs in a different way to be able to use the clustering. This might be a problem, or it might not be.
*Not a Sermon, Just a Thought
*/
If you do use Linux, check drivers and the kernel (Score:2)
Also, check your drivers for SMP compatibility, some are better than others and have reputations for it. On the extremely rare occasions that you can get a manufacturer-supplied driver, you should be ok if it's hardware meant for big boxes like this.
Lastly, don't forget clusters. (warning: a long rant on clustering follows, if you aren't interested, this is the time to hit 'Back')
Linux is probably the ideal solution for clusters because of its efficiency and reliability, and the fact it hoses just about everything that is benchmarked against it on lower-end (ie. two processors or less) machines. If you have a lot of clients on the network that perform mundane tasks (ie. word processing, most spreadsheets, etc. that don't fully utilize system resources), utilizing idle cycles ala SETI@Home, combined with a network renovation to 100TX or better (not very expensive these days, but you'd want to eliminate the use of hubs for any machines involved, since bandwidth is at a premium) could yield a lot of horsepower that, at the very least, could supplement your big box (especially in the off hours provided the computers are left on). The money invested in this may give you more bang for the buck than buying new equipment, and if you want to do clustering smaller machines instead of one big box, you can use the remaining money to buy a set of dedicated machines and string them together on gigabit fiber-optic or something, and then bridge them to the rest of the network. x86 machines are great for clustering, PPC is a little overpriced for the power it gives you in many standard configurations(especially compared to Athlon processors if you do floating-point-intensive stuff) unless you use Apple's Altivec, where PPC tends to kick butt, and as for SPARC, ALPHA, MIPS, and anything I forgot to list here, you're on your own since I've never fooled with any of these. If you want to cluster, I'd stick to whatever the rest of the network uses so you only have to code for one processor.
How to parallelize (Score:2)
Three things are critical when parallelizing work:
SMP machines usually have better throughput and latency than raw CPU power - whereas clusters have more CPU power and weaker throughput and even worse latency.
The major advantage of cluster is that you can re-use them in multiple configurations (tree, (hyper)cube, pipeline, flat neighbourhood) - whichever meets your needs best.
As for cheap clustering: maybe you have enough clients/workstations that can be (mis)used as cluster during nighttime (and NICEd down during working hours) for a first test run?
All boils down on how (good) you can parallelize your simulation. If you cannot, multiple processors - in either configuration - won't help you.
I've got some smp boxes... (Score:2)
So far I've tried Linux, FreeBSD, BeOS, and win2k on them. I usually use Linux, I like it, and it runs well, for the most part, under smp. There are a few issues though with linux SMP. For starters, before the 2.4 kernel FreeBSD was much faster. I'm not sure if the 2.4 kernel is faster than FreeBSD, I havn't run numbers yet.
From my experience I'd say, if your going to use linux, make sure the drivers for your hardware all work under SMP. I've had some scsi cards, and some ultraDMA 66 ide cards whos drivers are terribly unstable under SMP. (You might have to read the code for the drivers and look at the comments)
Of course EVERYTHING I've done with smp has been on Intel hardware, and while I'm sure you can pick up 200K smb Xenon boxes, there probably not the most bang for the buck. If you really want to run linux SMP on a machine of this size, you'll probably have to go Sun, SGI or Alpha. Linux is pretty good on all three platforms there, but I'd stay away from RedHat on anything but intel hardware (ok me personally, even on intel).
So maybe the steps you should take will be,
1 Decide if you need 32 or 64 bit machines.
2 Pick up a dual proc workstation with similar processors, to develop and test on. Lets say you think you want an e450, so pick up used dual spark station from one of the many used dealers that cater to ISPs.
3 Make SURE all the drivers and code work well on it.
4 Buy the HUGE machine.
Re:Don't get Linux (Score:1)
On the other hand, I use HP-UX machinery at work that is very similar to what is called for here, and can assure you that:
--
NUMA from compaq (Score:1)
SUN (Score:1)
Consider particularly Alpha and SGI (Score:3)
SGI's systems are really designed for this. The NUMA architecture is a way to mix what you're doing (i.e. lots of cpus, lots of memory) with the ability to make some things "closer" to others. If you're able to think of your app at least slightly NUMA like, it works well.
Otherwise, let me recommend the Alpha boxes. Tru64 UNIX is a phenomenal operating system, and it scales beyond your imagining. It really is that good.
The other thing to think about is what kind of CPU usage are you doing. Assuming that you're using floating point computation, you need to immediately discount the Intel architecture. It's STILL hobbled by the terrible FPU that it's had for years, which is why the Athlon kicks it around on this stuff. Alpha and MIPS have the best FPUs implmented (SPARC is okay, but definitely not as good as the Alpha).
you need the right questions (Score:2)
First, is it fpu or integer intensive? If integer, intel can make sense. If fpu, can you use SIMD (MMX, 3dNow, alta vec, etc)? If you can, intel can still work, but if not, RISC is the only choice.
Second, how much memory bus does it really need? Ok, a lan will not cut it, but could you live with everything (cpu-cpu or cpu-memory) going over 1 100Mhz 64 bit bus (ok, I think the 8 way boxes have two busses, 4 cpu on each)? If so, an 8 way intel box may be the right choice. Test to see if you need 2MB cache, or if everything is good with 1MB or 512k, you can save some money. I don't know how the RISC boxes are with regards to bus, but they are better by a huge margin than intel. If you are going to wait a bit, the AMD 8 way boxes may be availible, you'd get stunning integer performance, ok fpu, and good memory bus performance, it is the same bus the Alpha uses.
Disk I/O, how much? are you loading from disk, and just chewing on it for a few days? If so, disk io doesn't much matter, but if it does, linux may not be the best choice.
I'm guessing that network io does not matter.
For this kind of task, I bet that linux will work just fine (spliting the multi threaded or multiple process app over multiple cpus). I'm guessing that if a big intel box is not suitable, a big alpha box is the best choice.
How much ram does the box need? If memory speed is the limiting factor, RISC usually has better memory designs.
$200k is not much money, really. I'm guessing it will buy you a fully tricked out 8 way Xeon Intel box, or somewhat less RISC. Of course, if you get more $, you can add more CPUs to your box, but not true for intel.
Hmmm, spend $100,000 on an intel box now, and $100,000 when you can get an 8 way AMD Athlon? Might be best of all worlds for your money.
Better questions people!
SMP summary (Score:1)
Skip the Celeron SMP machines. You have the money, you can afford better SMPness. Look at the Intel Xenon CPUs which are the best for SMP at the moment. You cannot use P4's in an SMP configuration but VERY SOON, you will be able to use AMD Athlons to provide SMP. These will likely be your best option, I would think, for i86 platforms. The AMD SMP bus seems better designed than Intel's, though perhaps AMD won't allow more than two CPUs at once?
Make sure if you use Linux, you are using the latest stable 2.4 kernel. The difference on an SMP machine compared to a 2.2 kernel is astounding.
You are going to have some hard questions to answer about memory. DDR SDRAM is almost always the clear winner over RDRAM but perhaps things work differently in an SMP environment, I'm not sure. RDRAM, of course, is Intel's favourite while DDR SDRAM will be what Athlon SMP machines can use.
Anyway, I hope this helps. Remember, though, that you have the money to spend. Do some more research. Alpha architecture may well be the far superior platform for you, though it will cost lots more money.
The Bus a Bus (Score:2)
The most significant and compelling reason I can think of to not go with x86 architecture is that every PC MP setup besides AMD 760 MP (which isn't out yet, and will only support two processors to begin with anyway) has a truly pathetic bus architecture. If you're doing something where bandwidth makes a difference, which you would seem to be if you're getting a single SMP box rather than contemplating clustering, the bus bandwidth is going to figure in, too.
As you probably know, in an intel-based solution, all CPUs sit on the same bus to the so-called "memory masters [anandtech.com]";
Meanwhile, the AMD CPU bus architecture is more closely based on the Alpha EV6 bus, and each CPU has its own bus to sit upon.
Since you can't go with a AMD 760MP solution, you should get a real live unix box; Something slick from Compaq/DEC, Sun, IBM, or even HP (though I don't recommend it.) If you're looking for raw power and the ability to handle large numbers quickly, Alpha is probably your daddy.
Side note: Apologies to Busta for the title of my comment. I couldn't resist.
Re:Ho-hum (Score:2)
Don't get Linux (Score:2)
Re:NUMA from compaq (Score:1)
Re:Ho-hum (Score:2)
On my measurements they do compare. It's about a 5% hit for Linux vs. Tru64, and another 5% hit for the Compaq Tru64 compiler being better integrated into the OS than the Compaq Linux/Alpha compilers. So that's about 10% slower overall which isn't bad at all... It's true that gcc/egcs and it's math library is pretty lame on Alpha compared to Compaq compilers, but the Compaq compilers are free-ish so it's not a problem.
This is running MPI codes on 4-way Alpha/ev67/667 es40's with 8G mem and 2.2.16 Linux - so 2.4 Linux would be better - leaving only about a 5% difference between Linux and Tru64. Most of that is in page colouring (people guess) which Linux doesn't support.
The big thing that the Linux/Alpha version of the Compaq compilers doesn't support is OpenMP style SMP programming - so you'd have to do this by hand with threads or MPI or buy a different compiler.
My advice is to go with SMP Athlons as these are pretty much a match for Alpha in Floating Point these days - primarily 'cos Alpha MHz's haven't grown much for years. Athlons are also way cheaper so you can buy many more of them :-)
Re:The Bus a Bus (Score:2)
I know that the REALLY big Unisys/Compaq boxen (32-way P-III Xeons) use what's known as CMP, and I think that it's the basis in a scaled down form for most 8-way intel boxen. The acronym is for Cellular Multi-Processing, and the basic idea is that you have NUMA without calling it NUMA. Each "cell" is some number of processors (up to 4), some number of DIMMs, and a connector to a crossbar which connects everything.
I know that this is a very common architecture for cheaper SMP type boxes. Crossbar complexity grows exponentially with the number of things attached to it, so you take some buses and do a crossbar between THOSE, rather than just a huge crossbar. And it works as long as you're not able to max out any particular bus.
The core difference between something like CMP and something like the SGI NUMA boxen is that the SGI boxen have crossbars within each unit. So if you have a processing unit of 4 CPUs and 4 DIMMs, there's a crossbar between each of those, and a whole separate crossbar connecting each processing unit.
Connecting more than 4 of anything on a bus is worthless. I can max out the bus on a dual-P-III pretty easily. Give me two more and you'll start to see declining performance as everything waits for memory to be copied.
Re:Ho-hum (what is official) (Score:1)
Of course Alpha's official OS is OpenVMS! VMS Clustering puts all others to shame IMHO.
Has anyone else heard of Compaq's "Galaxy" technology? This must the best-kept secret around. Basically a Galaxy is a set of OS instances running on a single SMP box, with resources able to be redployed from one instance to another _on_the_fly_!.
Sysadmin: "hmm, that database server is chugging a bit, might just steal a CPU from the webserver for a bit"...*clickety-click*..."that's better."
/Perthling
2-4 CPUs: Linux 4 CPUs: UNIX (TM) (Score:1)
Of course, I'd be very keen to boot 2.4 on our 16 CPU machine and see how well it performs in comparison.
Xix.
Re:Ho-hum (Score:1)
I'm running C + MPI scientific codes. The MPI (lam mostly) calls map through to shared mem operations. My codes aren't super-heavy on the communications though - maybe 5% or 10% of the time is in comms in a mix of many small operations and some big bandwidth limited ops - so yeah - it's possible that an app which used GB/s of bandwidth between CPUs would hit more limitations in the Linux SMP implementation compared to Tru64.
As always it's a good idea to try your specific app and different OS's before buying the machine.
look closely at your calculations (Score:1)
if your going to end up doing a lot of number crunching, you should really look into some form of multiple cpu alpha. the alphas absolutley fly when it comes to floating point calculations. by the sounds of it, you would almost definately benefit from having some form of alpha solution. you will be better served to use compaq's tru64 unix than linux, im pretty sure that any alpha with 16+ cpu's would definately laugh at linux.
.brad
Drink more tea
organicgreenteas.com [organicgreenteas.com]