Why Does Current Clustering Require Recoding? 75
AugstWest asks: "I've been doing some research into what the available clustering options are for pooling CPU resources, and it looks like most of the solutions I've found require that programs be re-written to take advantage of the cluster. Since there are virtualization apps like Bochs and VMWare, where the applications just make use of a virtual CPU as if it was a real CPU, why aren't there clustering solutions that do this as well?"
latency? (Score:5, Insightful)
why aren't there clustering solutions that do this as well?
Because it's a lot faster to address a local CPU than it is to send that info down the wire to a remote CPU? And because of that latency, it's a lot easier to keep 2 or more local CPUs in sync than it is to keep 2 or more remote CPUs in sync?
You need to recode because you want to work around the latency, which is severe, of working via a network cable--so you design your apps to minimize messaging between CPUs. Some apps can do this well--they don't need results from other CPUs to complete their own information.
Other applications require CPUs to work in tandem, and for each CPU to have to wait while the results are served out over GigE would suck some serious ass, even if it might be technically possible.
Re:latency? (Score:3, Informative)
Example from my work is that we tend to write several hundred meg to several gig scratch files, and then perform RW operations on them continually during a calculation. If the disk isn't local to the process, then you end up flooding the network, and bringing everything to a screeching halt.
In a Mosixish/Condor type environment, you then have to deal with which processes,
No, the hard part is ... (Score:3, Interesting)
Performance (Score:3, Insightful)
Because it's virtualization, and thus hurts performance?
Qemu Virtualization (Score:2)
In the end, you'll get better performance and compatability out of coding for a cluster, rather than having the overhead and redirection of the virtualization process.
Because it's hard. (Score:3, Insightful)
Sure, you could make a clustering application that would run arbitrary x86 code on separate machines, but it would be many orders of magnitude slower than just running the code on one big Xeon.
Hell, it's hard enough to take a single thread and spread work across multiple execution units in the CPU for out-of-order execution, and too hard to do it across multiple CPUs in a single box. Why would it be possible across a network cable? Have I completely misunderstood the question?
cooking lessons (Score:4, Insightful)
As it stands today, an OS cannot easily share tasks. But there exists some tasks which are more easily shareable than others. I imagine within a century we'll be able to share tasks more easily and I think the CELL chip is meant to ease this transition but I could be wrong.
Re:cooking lessons (Score:2)
Re:cooking lessons (Score:4, Insightful)
If it's HA, you'd get 10 cooks each to make a roast. Sure, you'd end up with cooking extra meat but that doesn't matter - the goal here is to guarentee that a roast will be cooked no matter what. (I can imagine two copies of bochs running on seperate physical machines but linked to run in absoulte lock-step. Performance might be impared, but relability will be there.)
If it's performance, then you're right, you can't magically glue two computers together and get twice the performance.
Your example... (Score:2)
Yes, all the answers above were quite sufficient to explain why you have to re-code your app if you want for it to run _faster_ when you add _more_ nodes. And it is so easy to make it run slower -- I bet the original poster would benefit from trying to re-code some a sequential program to a parallel one at least once, then ask himself "How the heck can I teach
Re:Your example... (Score:1)
Re:Your example... (Score:1)
I've never seen that one attributed to anyone but Fred Brooks. (But it's still funny.)
part of the issue (Score:4, Informative)
This is in addition to the handling of resources such as database connections and other shared resources across the distributed cluster. I'm not exactly sure what your specific needs are but when you separate threads across different physical memory spaces, it creates significant problems to overcome. If you just want to virtualize the application (so one machine, many virtual machines, one physical memory), then the recoding should be trivial. And I agree, in this isolated case, no recoding should be necessary. But most of the time, clustering entails spaning multiple physical memories, and thus the application needs to be designed to handle these difficulties.
Infiniband (Score:2)
You can only use solutions that exist for the technology you're using. Likewise, though, you're not limited by constraints on technologies you're not using.
Mosix (Score:5, Insightful)
http://www.mosix.org/ [mosix.org]
openMosix (Score:2, Informative)
MOSIX License (Score:3, Informative)
Agreed.
Not by OSI/DFSG/FSF standards. The license [mosix.org] is still very restrictive. I think the kernel patches might be under GPL, but certainly not the user tools.
This is certainly true. Most talent jumped ship & openMosix does have a higher number of active developers (and is somewhat backed by AMD (though I think AMD can and should give more developers to
Re:Mosix (Score:1)
Not a single moderator gave me a point for this. Not to whine, but what kind of a clue bat does it take to get the CORRECT answers modded up at Slashdot?
Re:Mosix - a great answer, but not for everything. (Score:2, Insightful)
Beowulf clusters typically are designed for specific purposes and software is written to take advantage of the design. You can't
Re:Mosix - a great answer, but not for everything. (Score:2)
Maybe not for 2+2, but you could for large numbers which are not atomic to add. If it takes linear time to do a task on one processor, on the Connection Machine it could basically end up being lg n time.
Cf. here [stanford.edu]
TANSTAAFL (Score:5, Insightful)
Re:TANSTAAFL (Score:1)
Ask Slashdots like this one, and succinct replies like this one make me wish there was an option to immediately archive a story with just one comment if enough people vote for it.
Re:TANSTAAFL (Score:1)
I'm very happy that my clusters don't require forth.
duh (Score:3, Informative)
Basically, what's wrong with this idea is the clustering software has no way of knowing what it can chunk up and spit out to other nodes unless the programmer of the software in question tells it. Some multithreaded programs can be run on clusters without a rewrite, but there is already clustering software for that application. What the OP is suggesting is similar to rerouting highway traffic by arbitrarily plucking cars off the highway and putting them on random side streets. They all may get there eventually and, at first, it may seem like they are moving faster, but in the end it just takes everyone a lot longer to reach their destination. Now, if the drivers themselves planned alternate routes to help alleviate congestion on the highways, then there's a good chance everyone would get to their destinations faster.
Re:duh (Score:2)
Then again, it's even more like chopping up the highway into short, disconnected, side-by-side pieces, giving the group some number of go-karts (depending on the problem), and telling each person to drive down a different piece than they started on.
openMosix (Score:2)
Re:openMosix (Score:2)
but unless your app is heavily CPU bound it will probably stick to its home node.
If the app does much io, like say, processing batches postscript files, it will probably stay in its home node maybe unless you manage to get global block devices working to convince mosix that the io is as good from anynode.
Mosix sounds good because you don't have to "do" anything special but most apps won't benefit from it.
Re:openMosix (Score:3, Informative)
(And, for your particular example, mosix has a number of schedulers & you can schedule manually. You can trivially send one postscript file to each node. Of course you can do this "braindead" clustering with a script, but it
Re:openMosix (Score:2)
Re:openMosix (Score:2)
Wrong problem. (Score:2)
Compilers (Score:4, Interesting)
In my shameless search for a site to cite, I found this http://www-unix.mcs.anl.gov/dbpp/ [anl.gov] which covers lots of problems that have to be solved.
I'd love to see a language (or language extension) cleanly define a way to let me define a code block attributes which could affect how and where it gets executed. The runtime library could then distribute that block as the environment best allows.
Re:Compilers (Score:3, Informative)
Re:Compilers (Score:1)
Omni/SCASH: Cluster-enabled Omni OpenMP on a software distributed shared memory system SCASH [phase.hpcc.jp]
Re:Compilers (Score:2)
Re:Compilers (Score:3, Interesting)
The synchronize keyword was closer to where I was going. Suppose Java had a thread modifier keyword for looping operators. You could then:
each iteration of the looping block launches as a different task running i
Re:Compilers (Score:2)
They talk about OpenMP [openmp.org], (as The Boojum mentioned) and they use it in a way analogous to what you're describing there... an example: (Damnit... slashcode fuxors up the indenting...)
Listing 4: Implementation of replication sort
Listing 4: Implementation of replication sort (Score:2)
1 par (element=0; element<SIZE; element++) {
2 seq {
3 par (element2=0; element2<SIZE-1; element2++) {
4 ifselect(element>element2) {
5 if(uList[element] > uList[element2])
6 comp[element][element2] = 1;
7 } else ifselect (element<=element2) {
8
Re:Compilers (Score:1)
Re:Compilers (Score:4, Informative)
The venerable occam [wotug.org] programming language requires that each block of code be specifically identified as being executable either in parallel or sequentially. Since PAR and SEQ constructs can be nested it is easy to build up quite complex concurrent structures that can easily be distributed. Since the semantics of occam processes are derived from Hoare's CSP process algebra the compositional nature of occam's parallelism is theoretically sound, and avoids many of the problems associated with thread-based concurrency model that most people are familiar with.
Re:Compilers (Score:2)
Take a look at OpenMP [openmp.org].
It takes a commercial compiler, but its straight forward, an open specification, can be used "automagically", its portable across machines and languages.
It does not work on a clustered system, but only one that has local processors and memory.
What type of cluster do you want? (Score:3, Informative)
But all of them require that you add something to the original program which distributes the work (load balancing/render farms). If you want your original program to run in parallel then that is a much harder problem to solve. Basically you'll have to remake it into something like the above.
The last problem would basically require the computer to extract threads out of your code. This is pretty much impossible to do automatically though.
Re:What type of cluster do you want? (Score:2)
Re:What type of cluster do you want? (Score:2)
And I imagine that if you have a complex site with dynamic content from a database you'll still need to optimise your design to get as much use as possible from the load balancing.
What you want... (Score:2)
Oh, that sounds like a tough one to me. Ok, ok, it's actually not that tough - but it DOES require combining a number of kernel patches, there's no one-stop-shop (at the moment) for this. It also requires that network connections be IPv6, as there's bugger all mobility support out there for IPv4 for Linux
Re:What you want... (Score:2)
Because bandwidth is scarce. (Score:3, Interesting)
If your problem isn't that parallelizable and yet you need a whole cluster of computers to run it, odds are you need more efficiency than distributed shared memory can give you. You can access memory on your own node with orders of magnitude more bandwidth and less latency than on other nodes, and if your application doesn't take that into consideration it can run orders of magnitude slower.
Of course, that doesn't apply to every problem, and there are people trying to create exactly the cluster-as-computer architecture you'd like to see for ease of application programming. Check out OpenMosix and MigShm for one example - I haven't used the latter DSM patch myself but I know that for non-shared-memory programs, Mosix has had working process migration code for years.
Re:Because bandwidth is scarce. (Score:1)
This is a basic systems question. (Score:5, Informative)
[Why must] programs be re-written to take advantage of the cluster.
The simple answer is that programs, in general, are written as single threaded applications with shared state (memory). A cluster is the opposite of that - multiple parallel CPUs without shared state (or at least requiring one to be explicit about shared state, as opposed to simply declaring a variable).
Usually a program algorithm has to be completely re-designed in order to take advantage of the cluster, while mitigating the problems. At minimum the program must be parallelized. If you don't change the program to succesfully deal with shared memory latency then the cluster becomes nearly as powerful as a single fast computer running the program.
The reason you are asking this question is that you don't realize that a cluster is fundamentally different than a single (or dual or quad) CPU. The architecture is completely different. You can't expect to treat it like any old computer.
-Adam
mod parent up (Score:2)
Re:mod parent up (Score:2)
Re:mod parent up (Score:2)
anyw
What about the mythical TOE and RDMA (Score:1)
Re:What about the mythical TOE and RDMA (Score:2)
Generally it looks like he's asking for a VM to be an abstraction layer over a cluster. The problem is abstractions are simplifications, and you can't just simplify away the real problems of a cluster. There are some solutions tha
VirtualIron is probably what you're looking for (Score:2)
http://www.virtualiron.com/ [virtualiron.com]
My try (Score:3, Interesting)
The virtual machines you mention all run on a single existing system. You want a virtual machine that runs on multiple systems. That goes way beyond what the existing VMs do. They just implement the hardware instructions of a single system in software running on a single system. Taking that implementation and spreading it out among multiple systems means anticipating every clustering problem the code might raise, and solving it in advance.
Nobody knows how to do that. If they did, they'd implement it as the back end of compiler rather than waste the overhead of using a VM.
(They say that there are no stupid questions. Not true. But there are lame stupid questions, and interesting stupid questions. My vocation [picknit.com] is answering interesting stupid questions, which is why I'm grateful for this one!)
why don't you do a little experiment? (Score:2)
i can guarantee that the serial algorithm is about a day's worth of effort to implement; however, the parallel one will require at least a week. as you start working through the parallel implementation, you'll quickly discover that all the things that are true in a shared
Holy Grail (Score:4, Insightful)
Your language doesn't support it (Score:4, Informative)
The only way you'll have source code that compiles and runs unmodified on architectures of widely varying parallelism efficiently is for the language itself to know about parallelism, and make it the compiler's (and even runtime-linker and kernel's) job to parallelize your code for you. An inherently parallel language would have ways for you to specify in your source code what can and cannot be executed in parallel, and what code absolutely depends on the serial execution of some previous code. Even then, we're really only talking about the SMP case. When you start involving network latencies and bandwidth restrictions, the decisions on when and how to parallelize become more challenging for the compiler/runtime, possibly requiring either more intelligence on its part and/or more meta-information in your source code.
Until you write code in a language like that, you can never expect to write code in a single-threaded mindset and then have it just magically take advantage of a parallel environment.
completely different issues (Score:1)
Virtualization is a many-problem, one-cpu situation. Various software tricks make each of several programs think they have an entire system to themselves. In reality it all runs inside a virtualizer/emulator. Speed
VMware doesn't virtualize the CPU (Score:1)
Re:VMware doesn't virtualize the CPU (Score:2)
cilk (Score:1)
They also had a couple of graduate students that made Jilk
The halting problem (Score:2)
I bring this up because because it seems like, in order to partition the tasks efficiently, you'd basically need to be able to predict what the program was trying to do in advance.... and if you could predict what a program was going to do in advance of actually running it, it would seem like you have just solved the halting problem.