Follow Slashdot blog updates by subscribing to our blog RSS feed


Forgot your password?
Slashdot Deals: Cyber Monday Sale! Courses ranging from coding to project management - all eLearning deals 25% off with coupon code "CYBERMONDAY25". ×

Submission + - Supercomputers' growing resilience problems ( 2

angry tapir writes: "As supercomputers grow more powerful, they'll also grow more vulnerable to failure, thanks to the increased amount of built-in componentry. Today's high-performance computing (HPC) systems can have 100,000 nodes or more — with each node built from multiple components of memory, processors, buses and other circuitry. Statistically speaking, all these components will fail at some point, and they halt operations when they do so, said David Fiala, a Ph.D student at the North Carolina State University, during a talk at SC12. Today's techniques for dealing with system failure may not scale very well, Fiala said."
This discussion was created for logged-in users only, but now has been archived. No new comments can be posted.

Supercomputers' growing resilience problems

Comments Filter:
  • The joke in the industry is that supercomputing is a synonym for unreliable computing. Stuff like checkpoint-restart were basically invented on super-computers because it was so easy to lose a week's worth of computations to some random bug. When you have one-off systems or even 100-off systems you just don't get the same kind of field testing that you get regular off-the-shelf systems that sell in the millions.

    Now that most "super-computers" are mostly just clusters of off-the-shelf systems we get a dif

  • "they may," some other guy said.

"Today's robots are very primitive, capable of understanding only a few simple instructions such as 'go left', 'go right', and 'build car'." --John Sladek