Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
Hardware Hacking Supercomputing The Almighty Buck

Linux Clustering Hardware? 201

Kanagawa asks: "The last few years have seen a slew of new Linux clustering and blade-server hardware solutions; they're being offered by the likes of HP, IBM, and smaller companies like Penguin Computing. We've been using the HP gear for awhile with mixed results and have decided to re-evaluate other solutions. We can't help but notice that the Google gear in our co-lo appears to be off-the-shelf motherboards screwed to aluminum shelves. So, it's making us curious. What have Slashdot's famed readers found to be reliable and cost effective for clustering? Do you prefer blade server forms, white-box rack mount units, or high-end multi-CPU servers? And, most importantly, what do you look for when making a choice?"
This discussion has been archived. No new comments can be posted.

Linux Clustering Hardware?

Comments Filter:
  • Check out Xserve (Score:5, Informative)

    by Twid ( 67847 ) on Saturday May 21, 2005 @04:50PM (#12600798) Homepage
    At Apple we sell the Xserve Cluster node [apple.com] which has been used for clusters as large as the 1,566 node COLSA [apple.com] cluster. We also sell it in small turn-key [apple.com] configurations.

    Probably the most interesting news lately for OS X for HPC is the inclusion of Xgrid with Tiger. Xgrid is a low-end job manager that comes built-in to Tiger Client. Tiger Server can then control up to 128 nodes in a folding@home job management style. I've seen a lot of interest from customers in using this instead of tools like Sun Grid Engine for small clusters.

    You can find some good technical info on running clustered code on OS X here [sourceforge.net].

    The advantage of the Xserve is that it is cooler and uses less power than either Itanium or Xeon, and it's usually better than Opteron depending on the system. In my experience almost all C or Fortran code runs fine on OS X straight over from Linux with minimal tweaking. The disadvantage is that you only have one choice: a dual-CPU 1U box - no blades, no 8-CPU boxes, just the one server model. So if your clustered app needs lots of CPU power it might not be a good fit. For most sci-tech apps, though, it works fine.

    If you're against OSX but still like the Xserve, Yellow Dog makes an HPC-specific [terrasoftsolutions.com] Linux distro for the Xserve.

  • well... (Score:5, Informative)

    by croddy ( 659025 ) on Saturday May 21, 2005 @04:53PM (#12600820)
    at the moment we have a rack with Dell PowerEdge 1750's. They're very nice for our OpenSSI cluster, with the exception of the disk controller. Despite assurances by Dell that the MegaRAID unit is "linux supported", we're now stuck with what's got to be the worst SCSI RAID controller in the history of computing.

    we're hoping that upgrading to OpenSSI 1.9 (which uses a 2.6 kernel instead of the 2.4 kernel in the current stable release) will show better disk performance... but... yeah.

  • by FuturePastNow ( 836765 ) on Saturday May 21, 2005 @04:54PM (#12600822)
    It's no dual Opteron server, but this Dan's Data [dansdata.com] article reviews what are probably the cheapest 1U servers you can buy. Definitely something to consider of you're going for cheap.
  • by imsabbel ( 611519 ) on Saturday May 21, 2005 @05:10PM (#12600894)
    Not to mention that if you need more power, you can drop in the new dual core opterons without need for anything but eventually a bios upgrade.
    (not to mention that the dual core opterons acutally consume less power than some of the early steppings of the single core ones)
  • by Anonymous Coward on Saturday May 21, 2005 @05:11PM (#12600899)
    My .02 cents worth ...

    I build Linux and Apple clusters for biotech, pharma and academic clients. I needed to announce this because clusters designed for lifesci work tend to have different architecture priorities than say clusters used for CFD or weather prediction :) Suffice it to say that bioclusters are rate limited by file I/O issues and are tuned for compute farm style batch computing rather than full on beowulf style parallel processing.

    I've used *many* different platforms to address different requirements, scale out plans and physical/environmental constraints.

    The best whitebox vendor that I have used is Rackable Systems (http://www.rackable.com/ [rackable.com] . They truly understand cooling and airflow issues, have great 1U half-depth chassis that let you get near blade density with inexpensive mass market server mainboards and they have great DC power distribution kit for larger deployments.

    For general purpose 1U "pizza box" style rackmounts I tend to use the Sun V20z's when Opterons are called for but IBM and HP both have great dual-Xeon and dual-AMD 1U platforms. For me the Sun Opterons have tended to have the best price/performance numbers from a "big name" vendor.

    Two years ago I was building tons of clusters out of Dell hardware. Now nobody I know is even considering Dell. For me they are no longer on my radar -- their endless pretend games with "considering" AMD based solutions is getting tired and until they start shipping some Opteron based products they not going to be a player of any significant merit.

    The best blade systems I have seen are no longer made -- they were the systems from RLX.

    What you need to understand about blade servers is that the biggest real savings you get with the added price comes from the reduction in administrative burden and ease of operation. The physical form factor and environmental savings are nice but often not as important as the operational/admin/IT savings.

    Because of this, people evaluating blade systems should place a huge priority on the quality of the management, monitoring and provisioning software provided by the blade vendor. This is why RLX blades were better than any other vendor even big players like HP, IBM and Dell.

    That said though, the quality of whitebox blade systems is usually pretty bad -- especially concerning how they handle cooling and airflow. I've seen one bad deployment where the blade rack needed 12 inch ducting brought into the base just to force enough cool air into the rack to keep the mainboards from tripping their emergency temp shutdown probes. If forced to choose a blade solution I'd first grade on the quality of the management software and then on the quality of the vendor. I am very comfortable purchasing 1U rackmounts from whitebox vendors but I'd probably not purchase a blade system from one. Interestingly enough I just got a Penguin blade chasssis installed and will be playing with it next week to see how it does.

    If you don't have a datacenter, special air conditioning or a dedicated IT staff then I highly recommend checking out OrionMultisystems. They sell 12-node desktop and 96-node deskside clusters that ship from the factory fully integrated and best of all they run off a single 110v electrical. They may not win on pure performance when going head to head against dedicated 1U servers but Orion by far wins the prize for "most amount of compute power you can squeeze out of a single electrical outlet..."

    I've written a lot about clustering for bioinformatics and life science. All of my work can be seen online here: http://bioteam.net/dag/ [bioteam.net] -- apologies for the plug but I figure this is pretty darn on-topic.

    -chris

  • Re:Check out Xserve (Score:3, Informative)

    by Twid ( 67847 ) on Saturday May 21, 2005 @05:21PM (#12600932) Homepage
    That's simply not true. We stock parts for all Apple equipment for at least seven years. The drive caddys for the old Xserve G4 are exactly the same as what we ship in the current Xserve RAID, so not only is the drive caddy still available, it's also still in use [apple.com].

    Actually, the cheapest way to go would be to buy the 250GB ADM, apple part number M9356G/A, and pull out the 250GB drive and use it somewhere else (or just use that drive). Service parts tend to be pricey (like everyone else).
  • by chris.dag ( 22141 ) on Saturday May 21, 2005 @06:18PM (#12601209) Homepage

    Depends on the specific code to meet your criteria of "twice as fast"...some apps will be more than twice as fast; some will be slightly faster, equal or in some cases slower.

    For more general use cases (at least in my field) I can give a qualified answer of "dual Opteron currently represents the best price/performance ratio for small SMP (2-4 CPUs)".

    I've also seen cheaper pricing from Sun than what you mentioned. You are right though in that there is a price difference between xeon vs opteron - whenever I consider a more expensive alternative I tend to have fresh app benchmark data handy to back up the justification.

    -Chris
  • SunFire Servers (Score:5, Informative)

    by PhunkySchtuff ( 208108 ) <kai&automatica,com,au> on Saturday May 21, 2005 @06:26PM (#12601292) Homepage
    SunFire v20z or v40z Servers.
    http://www.sun.com/servers/entry/v20z/index.jsp [sun.com]
    http://www.sun.com/servers/entry/v40z/index.jsp [sun.com]
    They're the entry-level servers from Sun, so they have great support. They're on the WHQL List, so Windows XP, 2003 Server and the forthcoming 64-bit versions all run fine.
    They also run Linux quite well, and as if that wasn't enough, they all scream along with Solaris installed.
    The v20z is a 1 or 2 way Opteron box, in a 1RU case. the v40z is a two or for CPU box that is available with single or dual core Opterons.
    Plus, they're one of the cheapest, if not the cheapest, Tier 1 Opteron servers on the market.
  • by Anonymous Coward on Saturday May 21, 2005 @06:53PM (#12601523)
    You are wrong. In *most* cases, cheap is the the way to go (but soo few people realize it).

    Short argument: Google choosed the cheapest way, they are happy with it and don't plan to change their mind.

    Long argument: What you don't realize is that cheap hardware is so DAMN cheap, that you CAN afford employing the manpower required to maintain your cluster. In other words: instead of spending a lot of money on expensive hardware, buy the cheapest gear (where the price/perf ratio is optimal), and then spend a fraction of the money you saved to employ technicians that will maintain/replace defectuous hardware. Moreover with such cheap hardware, you don't even care about the warranty/support, you could trash it directly without even bothering contacting the manufacturer to get replacement parts ! Nowadays $300 is all you need to get a mobo + AMD Sempron 2800 + 512 MB RAM + 80 GB harddisk + NIC 100 Mbps + PSU. So why spend $3000 on a dual-opteron server when you could buy 10, TEN, fscking white boxes for the same price. The only reason would be that the software cannot be parallelized, or that you have constraints on the space occupied by your clusters (that's why I said "in *most* cases cheap is the way to go"). But then even in thoses cases you could maybe spend the money to rent more racks/space, and then you would save money by buying those white boxes.
  • Re:SunFire Servers (Score:3, Informative)

    by eryanv ( 719447 ) on Saturday May 21, 2005 @07:24PM (#12601737)
    The Sun 1U servers are great products for clusters. Last year our research group purchased around 72 v60x's (dual Xeons) and they've never given us a problem. Sun doesn't sell this model anymore as they've pretty much dropped Intel for AMD, but they run Linux with ClusterMatic just fine. Not only that, but Sun has the best techincal support I've ever encountered, I don't mind when we do need to call them up to fix something.
    Besides that, we also have several clusters of midsized towers from a local company, which have the benefit of being able to be used as desktop machines when the cluster gets replaced.
  • It Depends... (Score:4, Informative)

    by xanthan ( 83225 ) on Saturday May 21, 2005 @07:35PM (#12601814)
    I think most of my posts to Slashdot begin with "It Depends..." =)

    Answering this... It Depends...

    What is your cluster's tolerence to failure? If a node can fail, then you have the option of buying a lot of cheap hardware and replacing as necessary. This is the way that most big web farms work.

    What is your cluster machine requirements? Do you have heavy I/O? Does cache memory matter? Do you need a beefy FSB and 64G of RAM per node? You may find that spending $3000/node ends up being cheaper than buying three $1000 nodes because the $3000 node is capable of processing more per unit time than the three $1000 units are.

    What is your power/rack cost constraint? Google is an invalid comparison simply because of their size. They boomed when a lot of people were busting and co-lo's were hungry for business. I'd bet they have a song of a deal in terms of space, power, and pipe. You are not Google and I doubt you have a similar deal. Thus, you may find that there is a middle ground where it is better to get a more powerful machine to use less rack space/power.

    In the end, you have to optimize between these three variables. You'll probably find that the solution, for you specifically, is going to be unique. For example, you may find that: Node failure is an option since the software will recover, power/rack costs are sufficiently high that you have to limit yourself, and CPU power with a good cache is crucial, but I/O isn't. This means getting a cheaper Athlon based motherboard with so-so I/O and cheesy video is a good choice since it frees your budget for a fast CPU. Combine with the cheapest power suppy the systems can tolerate and PXE boot and you have your ideal system.

    Best of luck.
  • by algae ( 2196 ) on Saturday May 21, 2005 @08:21PM (#12602086)
    Agreed, that's what my company is using. As an added bonus, their reliability is amazing as long as you go with tier 1 hardware manufacturers. We discovered after I'd installed lm_sensors on our cluster that one of the machines had a dead heatsink-fan and dead exhaust fans, and yet the CPU was only at 77 degrees C and was still running under full load. I'd like to see a Xeon do that.
  • by Glonoinha ( 587375 ) on Saturday May 21, 2005 @08:49PM (#12602235) Journal
    That's a cool looking rig, but it lists at about $850 (Australian, like $700 US) with no CPU, RAM, or drives.

    Here [dell.com] is a Dell PowerEdge 750 1U server with a P4 2.8GHz HyperThreaded, 256M and an 80G SATA (with room to add another drive and up to 4G of memory) for $499 shipped to your door. Yea, I'm a Dell fanboy, but even if I wasn't I would still see a pretty good price point in that box.

    Note that this specific box is pretty low end and could use some upgrades, but it is a complete machine ready to run, esp if you want to go on the cheap.
  • by sirsnork ( 530512 ) on Saturday May 21, 2005 @10:33PM (#12602731)
    Not sure why you think Xeon's have better I/O than Opterons but I would spend some serious time looking into the on-chip memory controllers on Opterons and also do the maths to establish the maximum amount of data a Xeon based systems front side bus can move
  • Re:Check out Xserve (Score:3, Informative)

    by Twid ( 67847 ) on Saturday May 21, 2005 @11:33PM (#12602936) Homepage
    Well, you can think what you want, but I'm not a paid Apple shill surfing Slashdot for posting opportunities. I saw the story go up, and since I do cluster pre-sales at work, I thought I had some topical comments. I even mentioned running Linux on the Xserves. :)

  • by jimshep ( 30670 ) on Sunday May 22, 2005 @09:09AM (#12604437)
    We recently puchased a 13 node dual Opteron cluster from Penguin Computing after evaluating clusters from them, Dell, IBM, HP and single memory image machines from SGI. Penguin provided a solution with the best price/performance as well as ease of use. They let us benchmark our codes on some of their test clusters to determine wheteher the Opteron or Xeon based clusters would be better suited to us. My favorite aspect of their system besides price was the plug-and-play setup. The cluster was shipped fully assembled, configured, and tested to our site. All we had to do was roll it out of the crate, plug in the power and network connections, configure the network settings in the OS, and start running our simulations. All of the solutions from the other vendors would have required significant setup time on our behalf unless we spent a large amount of money for the services. I also really like the Scyld operating system that was included in the cluster. It makes the cluster work almost like a single image memory machine. Scyld on the compute nodes is setup to download the kernel image and necessary libraries from the master node at boot-up, so any changes made to the master node automatically propagate to the compute nodes. After several months of running simulations, the cluster has not given us any problems. It has been very reliable (never needed a reboot). Their technical support has been very responsive about ansering questions we had with the initial startup.

Intel CPUs are not defective, they just act that way. -- Henry Spencer

Working...