Forgot your password?
typodupeerror
Databases Programming Software IT Technology

Learning High-Availability Server-Side Development? 207

Posted by kdawson
from the servers-not-breaking-a-sweat dept.
fmoidu writes "I am a developer for a mid-size company, and I work primarily on internal applications. The users of our apps are business professionals who are forced to use them, so they are are more tolerant of access times being a second or two slower than they could be. Our apps' total potential user base is about 60,000 people, although we normally experience only 60-90 concurrent users during peak usage. The type of work being done is generally straightforward reads or updates that typically hit two or three DB tables per transaction. So this isn't a complicated site and the usage is pretty low. The types of problems we address are typically related to maintainability and dealing with fickle users. From what I have read in industry papers and from conversations with friends, the apps I have worked on just don't address scaling issues. Our maximum load during typical usage is far below the maximum potential load of the system, so we never spend time considering what would happen when there is an extreme load on the system. What papers or projects are available for an engineer who wants to learn to work in a high-availability environment but isn't in one?"
This discussion has been archived. No new comments can be posted.

Learning High-Availability Server-Side Development?

Comments Filter:
  • 2 words (Score:2, Informative)

    by andr0meda (167375)
    • Re: (Score:3, Interesting)

      by teknopurge (199509)
      I just finished reading that paper and was left with the impression that I had just wasted 10 minutes. I could not find a single insightful part of their algorithm - and in fact can enumerate several 'prior art' occurrences form my CPSC 102 class during my undergrad - all were lab assignments.

      I did, however, find this sentence disturbing:

      However, given that there is only a single master, its failure is unlikely; therefore our current implementation aborts the MapReduce computation if the master fails.

      Huh? So, because there is only one master it is unlikely to fail? This job takes hours to run. This is similar to saying that if you have one web server, it is unlikely

      • Re: (Score:2, Funny)

        by teknopurge (199509)
        I can feel the grammar Nazi's stalking me even now...
      • While the condescending attitude might make you feel better about yourself, it seems that they took this "lab assignment" and honed it into a system to make themselves a few bucks.

        Oh, and they also use it all the time on one of the world's largest data warehouses.
      • Re: (Score:3, Informative)

        by PitaBred (632671)
        A single point of failure is better than multiple points of failure, though, where any one failing would stop things dead in the water (think how a RAID0 array is less reliable than a single drive by itself, statistically). I'd hope that anyone working at Google would realize that, and would mean that, rather than the meaning you took from it :)

        But who knows... you could be right. I'm just playing devil's advocate.
        • by Zwack (27039)
          But RAID0 isn't intended to be reliable... It's intended to be FAST. RAID 1 is intended to be reliable...
          RAID 0+1 is intended to be fast and reliable.

          A single master is a single point of failure. However if that server failing doesn't cause running issues then they can ignore that single point of failure as it doesn't matter to production. I would imagine that the code on the Master is well tested by now (and may be very simple anyway) which just means that they now have to worry about hardware failures
      • No fallacy.... (Score:5, Insightful)

        by encoderer (1060616) on Thursday August 23, 2007 @02:20PM (#20333479)

        Huh? So, because there is only one master it is unlikely to fail?

        Yes. If you take that sentence in context, the answer is "Yes." Compared to the likelihood that one of the thousands of worker-machines will fail during any given job, it IS unlikely that the single Master will fail. Moreover, while any given job may take hours to run, it also seems that many take just moments. Furthermore, just because a job may take hours to run doesn't mean it's CRITICAL that it be completed in hours. And, at times when a job IS critical, that scenario is addressed in the preceeding sentence: It is easy for a caller to make the master write periodic checkpoints that the caller can use to restart a job on a different cluster on the off-chance that a Master fails.

        If a job is NOT critical, the master fails, the caller determines the failure by checking for the abort-condition, and then restarts the job on a new cluster.

        It's not a logical fallacy, nor is it a bad design.

        For the benefit of anyone reading thru, here is the parapgraph in question. It follows a detailed section on how the MapReduce library copes with failures in the worker machines.

        It is easy to make the master write periodic checkpoints of the master data structures described above. If the master task dies, a new copy can be started from the last checkpointed state. However, given that there is only a single master, its failure is unlikely; therefore our current implementation aborts the MapReduce computation if the master fails. Clients can check for this condition and retry the MapReduce operation if they desire.
  • Well... (Score:4, Informative)

    by Stanistani (808333) on Thursday August 23, 2007 @11:08AM (#20330685) Homepage Journal
    You could start by reading this book [oreillynet.com] for a practical approach:

    Zawodny is pretty good...
    • Re:Well... (Score:4, Informative)

      by tanguyr (468371) <tanguyr+slashdot@gmail.com> on Thursday August 23, 2007 @11:35AM (#20331043) Homepage
      I also recommend the book Building Scalable Web Sites [oreilly.com], also from O'Reilly. Loads of good ideas on clustering, performance monitoring, even some ideas on scaling the development process itself. Scalability and high availability are not the same thing, but much of the material covered in this book is relevant to both. /t
      • by tholomyes (610627)
        I just picked "Building Scalable Web Sites" up four or five weeks ago and I'll second that recommendation; the book is really well written and actually a fairly quick read, a rarity even among O'Reilly books. It covers a lot of ground comprehensively, and is organized in a way that makes sense.
    • by PhrostyMcByte (589271) <phrosty@gmail.com> on Thursday August 23, 2007 @11:41AM (#20331119) Homepage

      Is that most of them have poor native APIs when it comes to scalability. Some of them have something like

      handle = query("SELECT...");
      /*do something*/
      result = wait(handle);

      But that is far from optimal. When will they be smart and release an async API that notifies you via callback when complete? This would be very useful for apps that need maximum scalability.

      Microsoft's .NET framework is actually a great example of doing the right thing - it has these types of async methods all over the place. But then you have to deal with cross-platform issues and problems inherent with a GC.

      It's not that much different for web frameworks either. None that I've tried (RoR, PHP, ASP.NET) have support for async responding - they all expect you to block execution should you want to query a db/file/etc. and just launch boatloads of threads to deal with concurrent users. I guess right now with hardware being cheaper it is easier to support rapid development and scale an app out to multiple servers.

      • by Khazunga (176423) on Thursday August 23, 2007 @12:19PM (#20331667)

        Most databases have async APIs. Postgresql and mysql have them in the C client libraries. Most web development languages, though, do not expose this feature in the language API, and for good reason. Async calls can, in rare cases, be useful for maximizing the throughput of the server. Unfortunately, they're more difficult to program, and much more difficult to test.

        High scale web applications have thousands of simultaneous clients, so the server will never run out of stuff to do. Async calls have zero gain in terms of server throughput (requests/s). It may reduce a single request execution time, but the gain does not compensate the added complexity.

        • The only async apis they have are like the example I gave before. These are sub-optimal!

          It's true that with a single server handling 10, 100, 200 RPS, the stupid threaded model will likely not make a big difference in _throughput_. It will make a MASSIVE difference in CPU/RAM usage though, and let you easily scale up to 10000 RPS on commodity hardware using just a single thread. Some people like to maximize their hardware usage.

          And async is certainly not much more difficult - it's a new way of thinki

          • by nuzak (959558)
            I would be very surprised to see a database that didn't offer per-row callback functions in its call level API -- even SQLite has them. I won't hazard any guesses about MySQL, googling for the subject turned up too much PHP noise to be conclusive.

            I actually find async programming with a good API to be easier, because everything's an event, and you don't have to design the flow of control of everything else around constantly returning to poll for results, or deal with the locking and race conditions if you
      • Interesting.

        Do you think that non-blocking IO really offers enough performance gains to compensate for the resulting spaghetti code? This isn't a rhetorical question, I'm really curious.
        • by Jhan (542783)

          Why the spaghetti?

          Troubling blocking IO code in C++:ish pseudo:

          result1 = doIo(foo); // Blocking IO, wait 5s
          ... // Maybe do other things
          result2 = doIo(bar); // Blocking IO, wait 5s
          ... // Maybe do other things
          // Result 1 and 2 are first used here
          print("Result was "+resullt1+"&"+result2);
          // Min 10s to get here

          So, add this object

          class Unblocker {
          __operationHandle operationInProgress;
          __Unblocker(parameter) {
          ____operationInProgress = sendIo(parameter); // Non-blocking version of doIo above, as pr

      • by bidule (173941)

        Is that most of them have poor native APIs when it comes to scalability. Some of them have something like

        handle = query("SELECT..."); /*do something*/
        result = wait(handle);

        But that is far from optimal. When will they be smart and release an async API that notifies you via callback when complete? This would be very useful for apps that need maximum scalability.

        I don't understand what is wrong with that.

        Are you unhappy about the /*do something*/ part, because you'd want the handle released early to minimize t

        • see this comment [slashdot.org] for more along the lines of what I'm talking about.

          I'm unhappy about the wait() call because it doesn't lend itself to fully async coding - if you've got nothing to do in that context, you're stuck blocking the thread when it could be doing other things. So now you have to waste CPU on context switches and waste RAM on state for a new thread.

          A good callback-based API doesn't have these deficiencies. You just call a function to dequeue completion callbacks, from however many threads yo

          • I'm unhappy about the wait() call because it doesn't lend itself to fully async coding - if you've got nothing to do in that context, you're stuck blocking the thread when it could be doing other things. So now you have to waste CPU on context switches and waste RAM on state for a new thread.

            So? Yuo can spend a little bit of cpu time and run more threads or, since the load on the DB is likely to be the bottleneck, get more boxes. Hardware is cheap. Debug time is not, and async programming is harder than

      • by Jhan (542783)

        Is that most of them have poor native APIs when it comes to scalability. Some of them have something like

        handle = query("SELECT...");
        /*do something*/
        result = wait(handle);
        </ecode
        But that is far from optimal. When will they be smart and release an async API that notifies you via callback when complete? This would be very useful for apps that need maximum scalability.
        </i></blockquote>
        <p>Since you seem to maybe be talking circumspisciously about Java:
        <ecode>
        final Foo objectT

        • by Jhan (542783)

          What I ment to say was of course:

          Since you seem to maybe be talking circumspectedly about Java:

          final Foo objectToNotify = theFoo;
          Thread.run(new Runnable() {
          __public void run() {
          ____handle = query("SELECT...");
          ____/*do something*/
          ____result = wait(handle);
          ____objectToNotify.notify(result);
          __}
          });
      • Re: (Score:3, Interesting)

        by Kenneth Stephen (1950)

        I'm afraid the parent post is an example of not seeing the forest because of all the trees.....

        Application code should never ever be aware of deployment issues. Making it aware of such things a sure way to ensure nightmares when your environment changes. For example, lets say you have to send mail. You could take the option of always talking to localhost under the assumption that your app will always be deployment on a machine with a mail server. But consider the case when the app is taken and deployed in

        • did you reply to the wrong comment? async has nothing to do with deployment or failover, it's just a different method for being notified when an operation is done.
    • Zawodny? Tay Zawodny?
  • Generally, Saas (software as a service) providers have to scale their apps. The development issues they have are more or less solved. Look it up on Google... ('saas scalability problem').
  • Here goes... (Score:3, Informative)

    by Panaflex (13191) * <convivialdingoNO@SPAMyahoo.com> on Thursday August 23, 2007 @11:15AM (#20330785)
    Ok, first up:
    1. Check all your SQL and run it through whatever profiler you have. Move things into views or functions if possible.
    2. CHECK YOUR INDEXES!! If you have SQL statements running that slow, the likely cause is not having proper indexes for the statements. Either make an index or change your SQL.
    3. Consider using caching. For whatever platform you're on there's bound to be decent caching.

    That's just the beginning... but the likely cause of most of your problems. We could go on for a month about optimizing.. but in the end if you just stuck with what you have and checked your design for bottlenecks you could get by just fine.
  • by dominux (731134) on Thursday August 23, 2007 @11:17AM (#20330823) Homepage
    start by being clear about what you want to achieve. If it is HA then you want to look at clustering, failover, network topology, DR plans etc. If it is HP then look for the bottlenecks in the process, don't waste time shaving nanoseconds off something that wasn't bothering anyone. At infrastructure level you might think about cacheing some stuff, or putting a reverse proxy in front of a cluster of responding servers. In general disk reads are expensive but easily cached, disk writes are very expensive and normally you don't want to cache them, at least not for very long. Network bandwidth may be fast or slow, latency might be an issue if you have a chatty application.
    • Exactly right. I'd like to add that if you want to write really scalable code then use an asynchronous approach as much as possible. Some programming languages and toolkits make this easy, some make it hard, but it's possible in any. If your database server is slow responding to your application server, make sure your app server can do useful work while it's waiting. The same is true of communication between parts of the server.

      I'd thoroughly recommend that you learn Erlang, if you haven't already. Th

      • http://www.couchdb.com/ [couchdb.com] is a distributed replicable non-relational database written in Erlang. It is a very clever system and I was impressed with the language choice of the developer [damienkatz.net].
      • by leuk_he (194174)
        Agreed.

        High performance=short response times. In your case you can think about caching more and tuning the system and database access. Maybe you can make the application more scalable, but once you move the database to different server than the application you first get some extra (network) overhead instead of performance, specially in low load situations. And more iron/servers means more money.

        High availably [wikipedia.org] is about a 24x7 and no singe point of failure. One method for this is clustering [wikipedia.org] (more application
      • by demi (17616) *

        Luckily, we now have a good Erlang book in print (again): Programming Erlang by Joe Armstrong. Learn it, live it, love it.

        The language is almost certainly not suited to the kind of task described (it might be, but it's unlikely)

        I disagree. Erlang is perfectly good for general programming tasks and particularly well-suited for the sorts of demands placed on public web applications (which is sort of the undercurrent of the requester's question, I think). And while it's true that messaging, lightweight para

        • The reason I wouldn't recommend Erlang for this kind of task is that it's string manipulation sucks. Native strings are ASCII (WTF) and are stored in an incredibly inefficient way. You end up having to write your own code mapping between unicode and binaries (unless this has made it into OTP while I wasn't looking). Even then, the string manipulation syntax is a pain.

          Calling C code from Erlang is not a task for the faint-hearted either. You have to write a port driver to do it safely, which is a lot

    • Re: (Score:3, Insightful)

      by fifedrum (611338)
      clustering, word up (if we're allowed to use old catch phrases like that)

      disk reads and writes are the least of our troubles when we scaled much more than a small enterprise level of data. The sheer number of moving parts in our environment (not just physical parts, but bits flowing too) killed productivity and we wound up with the complete inability to cache anything.

      There's simply too much data flowing back and forth to make caching pay for itself and too often will a hard drive fail requiring even more b
  • by Applekid (993327) on Thursday August 23, 2007 @11:20AM (#20330873)
    Our in-house applications don't get built around performance at all (personally I find it disappointing but I don't write the rules... yet). We generally scale outwards: replicated databases, load distribution systems, etc.

    Many of the code guidelines we have established are to aid in this. Use transactions, don't lock tables, use stored procedures and views for anything complicated, things like that.

    I guess my answer is that we delegate it to the server group or the dba group and let them deal with it. I guess this means the admins there are pretty good at what they're doing. :)
    • by gatesvp (957062)

      I'm in a small shop and we do this too. Truth is, we don't even have a real DBA, but a few of us know SQL Server really well. The reason we actually do it this way is cost. On small projects, Dev time is really expensive, server resources are not. If you can support 30 more clients with one $5k server, then it's simply not worth Dev time to stress over performance.

      Truth is, if performance is becoming an issue, then the project should be generating enough revenue to justify the Dev time spent on performan

  • by Nibbler999 (1101055) <<tom_atkinson> <at> <fsfe.org>> on Thursday August 23, 2007 @11:21AM (#20330881) Homepage
    There are some good presentations on the web about how youtube, digg, google etc handle their scaling issues. Here's an example: http://video.google.com/videoplay?docid=-630496435 1441328559 [google.com]
  • check these out... (Score:4, Informative)

    by BillAtHRST (848238) on Thursday August 23, 2007 @11:25AM (#20330939)
    These are both decent starting points. Please report back if you find something good -- I'd be very interested.
    http://highscalability.com/ [highscalability.com]
    http://www.allthingsdistributed.com/ [allthingsdistributed.com]
  • by funwithBSD (245349) on Thursday August 23, 2007 @11:25AM (#20330941)
    Stress testing? Use LoadRunner or some other tool to simulate users.

    If you are using Java on Tomcat, BEA, or Websphere, use a product like PerformaSure to see a call tree of where your Java program is spending it's time. Sorts out how long each SQL takes too, and shows you what you actually sent. If you have external data sources, like SiteMinder, it will show that too.

    If you mean "What happens if we lose a bit of hardware" simulate the whole thing on VMware on a single machine and kill/suspend VMs to see how it reacts.

    Most importantly, MAKE SURE YOU MODEL WHAT YOU ARE TESTING. IF you are not testing a scaled up version of what users actually do, you have a bad test.
  • You'll find that Erlang doesn't even blink at those volumes, and that Erlang's entire reason to exist is scalability/reliability. Granted, it's a little severe to pick up a new language, but the benefits are enormous, and it's one of those boons you can't really understand until you've learned it. It is, however, worth noting that transactions on an MNesia database in the multiple gigabytes are typically faster than PHP just invoking MySQL in the first place, let alone doing any work with it.

    Erlang is dif
    • Slightly off topic (Score:3, Insightful)

      by Gazzonyx (982402)
      I keep hearing about Erlang being the next greatest thing since sliced bread... unfortunately, I don't have time to look into it too much. Could someone give me an 'elevator' pitch on what makes it so great for threading? Is it encapsulation based objects, a thread base class, or what? How does it handle cache coherency on SMP?
      • by Lord Grey (463613) *
        Erlang's Wikipedia entry [wikipedia.org] is longer than an elevator pitch, but has some decent information. Erlang's primary site is here [erlang.org].
      • by stonecypher (118140) <stonecypherNO@SPAMgmail.com> on Thursday August 23, 2007 @01:02PM (#20332243) Homepage Journal
        It's not something you can cram into an elevator pitch; erlang is an entirely different approach to parallelism. If you know how mozart-oz, smalltalk or twisted python work, you've got the basics.

        Basically, processes are primitives, there's no shared memory, communication is through message passing, fault tolerance is ridiculously simple to put together, it's soft realtime, and since it was originally designed for network stuff, not only is network stuff trivially simple to write, but the syntax (once you get used to it) is basically a godsend. Throw pattern matching a la Prolog on top of that, dust with massive soft-realtime scalability which makes a joke of well-thought-of major applications (that YAWS vs Apache [www.sics.se] image comes to mind,) a soft-realtime clustered database and processes with 300 bytes of overhead and no CPU overhead when inactive (literally none,) and you have a language with such a tremendously different set of tools that any attempt to explain it without the listener actually trying the language is doomed to fall flat on its face.

        In Erlang, you can run millions of processes concurrently without problems. (Linux is proud of tens of thousands, and rightfully so.) Having extra processes that are essentially free has a radical impact on design; things like work loops are no longer nessecary, since you just spin off a new process. In many ways it's akin to the unix daemon concept, except at the efficiency level you'd expect from a single compiled application. Every client gets a process. Every application feature gets a process. Every subsystem gets a process. Suddenly, applications become trees of processes pitching data back and forth in messages. Suddenly, if one goes down, its owner just restarts it, and everything is kosher.

        It's not the greatest thing since sliced bread; there are a lot of things that Erlang isn't good for. However, what you're asking for is Erlang's original problem domain. This is what Erlang is for. I know, it's a pretty big time investiture to pick up a new language. Trust me: you will make all your time back in writing far shorter, far more obvious code than you did in learning the language. You can pick up the basics in 20 hours. It's a good gamble.

        Developing servers becomes *really* different when you can start thinking of them as swarms.
        • From what you and Raven64 have said, my interest in Erlang is piqued! I think I'll give it a shot when I have some downtime this semester (I'm always hearing "try Ruby, lisp, perl, $silverBulletLanguage", and I just don't have the time). I just have one question about what you've said:

          Suddenly, applications become trees of processes pitching data back and forth in messages.

          We aren't talking a win32 style message pump kind of message passing mechanism, are we? I truly can't stand the message pump in win32 - it always feels like such a 'hack'; I don't have a better solution, though, so I've be

          • Re:Impressive (Score:4, Informative)

            by stonecypher (118140) <stonecypherNO@SPAMgmail.com> on Thursday August 23, 2007 @07:17PM (#20337375) Homepage Journal

            We aren't talking a win32 style message pump kind of message passing mechanism, are we?
            No. Many people have also raised the question of whether MSMQ, the new OS-level messaging service, is modelled on Erlang's; again, the answer is no.

            The problem is, it's hard to explain why. The overhead of using things like that is tremendous; Erlang's message system is used for quite literally all communication between processes, and a system like Windows Events or MSMQ would reduce Erlang applications to a crawl. Erlang uses an ordered, staged mailbox model, much like Smalltalk's. If you haven't used Smalltalk, then frankly I'm not aware of another parallel.

            It's important to understand just how fundamental message passing is in Erlang. Send and receive are fundamental operators, and this is a language that doesn't have for loops, because it thinks they're too high level and inspecific (you can make them yourself; I know, that must sound crazy, but once you get it, it makes perfect sense.)

            I truly can't stand the message pump in win32 - it always feels like such a 'hack'; I don't have a better solution, though, so I've been waiting for a better form of IPC.
            You're about to see a completely different approach. I'm not saying it's the best, or the most flexible, but I really like it, and it genuinely is very different. What Erlang does can relatively straightforwardly be imitated with blocking and callbacks in C, but that involves system threads, and then you start getting locking and imperative behavior back, which is one of the things it's so awesome to get rid of (imagine - no more locks, mutexes, spin controls and so forth. Completely unnessecary, both in workload, debugging and in CPU time spent. It's a huge change.)

            Really, it's a whole different approach. You've just got to learn it to get it.

            Yet Raven64 said there is no shared memory, so I'm confused on how the message passing happens.
            No, I said that. I wrote some code to help explain it to you, though of course slashdot's retarded lameness filters wouldn't pass it, so I put it behind this link [tri-bit.com]. Sorry it's not inline.

            Hopefully that will help. Sorry about the lack of whitespace; SlashDot's amazingly lame lameness filter is triggering on clean, readable code.
      • by TheRaven64 (641858) on Thursday August 23, 2007 @03:30PM (#20334481) Journal
        Erlang does some things very well. For good parallel programming, you need a language that enforces one rule: data may be mutable or aliased, but not both. If you don't have this rule, then debugging complexity scales with the factorial of the degree of parallelism. Erlang does this by making processes the only mutable data type. There are a few things that make it a nice language beyond that:
        • Very low overhead processes; creating a process in Erlang is only slightly more expensive than making a function call in C.
        • Higher order functions.
        • Pattern matching everywhere (e.g. function arguments, message receiving, etc). If you've want two different behaviours for a function depending on the structure of the data that it is passed (e.g. handlers for two different types of packet, with different headers) you can write two version of the function with a pattern in the argument.
        • Guard clauses on functions, lets you implement design-by-contract and also lets you separate out validation of arguments from the body of a function, giving cleaner code.
        • Simple message passing syntax, with pattern matching on message receive for out-of-order retrieval.
        • Asynchronous message delivery; very scalable.
        • Lists and tuples as basic language primitives.
        • Gorgeous binary syntax. I've never seen a language as good as Erlang for manipulating binary data.
        • Automatic mapping of Erlang processes to OS threads, allowing as many to run concurrently as you have CPUs.
        • Network message delivery, allowing Erlang code with only slight modifications to send messages over the network rather than to local processes (the message sending code is the same, only how you acquire the process reference is different).
        There are also a few down sides to the language:
        • The preprocessor is even worse than C's, so metaprogramming is hard (and badly needed; patterns like synchronous message sending or futures require a lot of copy-and-pasting).
        • Implementing ADTs is ugly (but no worse than C).
        • Variables are single static assignment, which is a cheap cop-out for the compiler writer and makes code convoluted at times.
        • Message sending and function call syntax is very different for no good reason. You are meant to wrap exposed (public) messages in function, which makes things even more messy.
        • Calling code in other languages is a colossal pain.
        • The API is inconsistent (e.g. some modules manipulating ADTs take the ADT as the first argument, some take it as the last).
        Erlang is a great language for a lot of tasks, particularly servers, but it's not suited for everything.
        • Uh.

          Implementing ADTs is ugly (but no worse than C).

          Whereas the current state of things leaves a lot to desired, it's nowhere near as bad as C; that's why ETS and DETS are effectively polymorphisms of one another.

          metaprogramming is hard (and badly needed; patterns like synchronous message sending or futures require a lot of copy-and-pasting).

          I see no reason that synchronous message passing needs any cut and pasting; why not just make a module to implement it? As far as futures, frankly I think they're misgu

  • I have a question sort of along the same line. I interviewed for a position at a very large internet company, and one of their primary concerns was very high performance and scalability. I went through the phone interviews and then the in-person interviews, and I actually did quite well, and was even told that I did quite well. However, in the end, I was told that while I did well, they would have liked to see more experience with very large web applications (I've worked at smaller companies). So, how d
  • From working on both academic and enterprise software designed specifically to scale, these are four things I've noticed are incredibly important to scalability:

    Languages - I recently saw a multi-million dollar product fail because of performance problems. A large part of it was that they wanted to build performance-critical enterprise server software, but wrote it mostly in a language that emphasized abstraction over performance, and was designed for portability, not performance. The language, of course
    • by EastCoastSurfer (310758) on Thursday August 23, 2007 @11:56AM (#20331339)
      Language - Doesn't matter much if you know how to design a scalable system. Some languages like Erlang force you into a more scalable design, but even then it's still easy to mess up. Unless this multi-million dollar project you're talking about was an embedded system I would bet language used was the smallest reason for bad performance. Although it is fun to bash java whenever the chance.

      Libraries - Bingo lets throw out nice blocks of tested and working code b/c it's always better to write it yourself. You pretty much have to use libraries to get things done anymore. And are you suggesting someone should write their own DB software when building a web app? Um, yeah see that web app ever gets done.

      Abstractions - While most are leaky at some point, abstractions make it easier for you to focus on the architecture (which is what you should be focusing on anyways when building scalable systems).

      I see these types of arguments all the time and they rarely make sense. It's like arguing about C vs. Java over 1ms running time difference when if you changed your algorithm you could make seconds of difference or if you changed your architecture you would make minutes of difference...
      • Or the GP was completely wrong... or maybe he has just tighter resources than you.

        All things he said are usefull to improve performance, and can lead to errors that will decrease said performance if you are not carefull enough. Of course, if your performance hits are due to gross architectural errors, you shouldn't even think on looking into them.

    • a language that emphasized abstraction over performance... The language, of course, was Java.

      When has abstraction ever been a strength of java? It has one fucking abstraction, and there are programmers out there who say that sun didn't even get that one right.

  • by John_Booty (149925) <johnbooty@bootyprojec3.14t.org minus pi> on Thursday August 23, 2007 @11:44AM (#20331145) Homepage
    From what I have read in industry papers and from conversations with friends, the apps I have worked on just don't address scaling issues. Our maximum load during typical usage is far below the maximum potential load of the system, so we never spend time considering what would happen when there is an extreme load on the system.

    Is it just me, or is the question hopelessly confused? He's using the term "availability" but it sounds like he's talking about "scalability."

    Availability is basically percentage of uptime. You achieve that with hot spares, mirroring, redundancy, etc. Scalability is the ability to perform well as workloads increase. Some things (adding load-balanced webservers to a webserver farm) address both issues, of course, but they're largely separate issues.

    The first thing this poster needs to do is get a firm handle on exactly WHAT he's trying to accomplish, before he can even think about finding resources to help him do it.
  • by srealm (157581) <prez@Nospam.goth.net> on Thursday August 23, 2007 @11:47AM (#20331189) Homepage
    I've worked in multiple extremely super-scaled applications (including ones sustaining 70,000 connections at any one time, 10,000 new connections each minute, and 15,000 concurrent throttled file transfers at any one time - all in one application instance on one machine).

    The biggest problem I have seen is people don't know how to properly define their thread's purpose and requirements, and don't know how to decouple tasks that have in-built latency or avoid thread blocking (and locking).

    For example, often in a high-performance network app, you will have some kind of multiplexor (or more than one) for your connections, so you don't have a thread per connection. But people often make the mistake of doing too much in the multiplexor's thread. The multiplexor should ideally only exist to be able to pull data off the socket, chop it up into packets that make sense, and hand it off to some kind of thread pool to do actual processing. Anything more and your multiplexor can't get back to retrieving the next bit of data fast enough.

    Similarly, when moving data from a multiplexor to a thread pool, you should be a) moving in bulk (lock the queue once, not once per message), AND you should be using the Least Loaded pattern - where each thread in the pool has its OWN queue, and you move the entire batch of messages to the thread that is least loaded, and next time the multiplexor has another batch, it will move it to a different thread because IT is least loaded. Assuming your processing takes longer than the data takes to be split into packets (IT SHOULD!), then all your threads will still be busy, but there will be no lock contention between them, and occasional lock contention ONCE when they get a new batch of messages to process.

    Finally, decouple your I/O-bound processes. Make your I/O bound things (eg. reporting via. socket back to some kind of stats/reporting system) happen in their own thread if they are allowed to block. And make sure your worker threads aren't waiting to give the I/O bound thread data - in this case, a similar pattern to the above in reverse works well - where each thread PUSHING to the I/O bound thread has its own queue, and your I/O bound thread has its own queue, and when it is empty, it just collects the swaps from all the worker queues (or just the next one in a round-robin fashion), so the workers can put data onto those queues at its leisure again, without lock contention with each other.

    Never underestimate the value of your memory - if you are doing something like reporting to a stats/reporting server via. socket, you should implement some kind of Store and Forward system. This is both for integrity (if your app crashes, you still have the data to send), and so you don't blow your memory. This is also true if you are doing SQL inserts to an off-system database server - spool it out to local disk (local solid-state is even better!) and then just have a thread continually reading from disk and doing the inserts - in a thread not touched by anything else. And make sure your SAF uses *CYCLING FILES* that cycle on max size AND time - you don't want to keep appending to a file that can never be erased - and preferably, make that file a memory mapped file. Similarly, when sending data to your end-users, make sure you can overflow the data to disk so you don't have 3mb data sitting in memory for a single client, who happens to be too slow to take it fast enough.

    And last thing, make sure you have architected things in a way that you can simply start up a new instance on another machine, and both machines can work IN TANDEM, allowing you to just throw hardware at the problem once you reach your hardware's limit. I've personally scaled up an app from about 20 machines to over 650 by ensuring the collector could handle multiple collections - and even making sure I could run multiple collectors side-by-side for when the data is too much for one collector to crunch.

    I don't know of any papers on this, but this is my experience writing extremely high performance network apps :)
  • by smackenzie (912024) on Thursday August 23, 2007 @11:50AM (#20331235)
    I see a lot of recommendations for various technologies, software packages, etc. -- but I don't think this addresses the original question.

    What you are asking about, of course, is enterprise-grade software. This typically involves an n-tier solution with massive attention to the following:

    - Redundancy.
    - Scalability.
    - Manageability.
    - Flexilibility.
    - Securability.
    - and about ten other "...abilities."

    The classic n-tier solution, from top to bottom is:

    - Presentation Tier.
    - Business Tier.
    - Data Tier.

    All of these tiers can be made up of internal tiers. (For example, the Data Tier might have a Database and a Data Access / Caching Tier. Or the Presentation Tier can have a Presentation Logic Tier, then the Presentation GUI, etc.)

    Anyway, my point is simply that there is a LOT to learn in each tier. I'd recommend hitting up good ol' Amazon with the search term "enterprise software" and buy a handful of well-received books that look interesting to you (and it will require a handful):

    http://www.amazon.com/s/ref=nb_ss_gw/002-8545839-8 925669?initialSearch=1&url=search-alias%3Daps&fiel d-keywords=enterprise+software+ [amazon.com]

    Hope this helps.
    • Re: (Score:3, Informative)

      by lycono (173768)
      The list of books in that search that are even remotely related to what the OP was asking is very short. I count 7 total results (of 48) in that list that _might_ be useful. Of which, only 1 actually sounds like it might be what the OP wants:
      • How To Succeed In The Enterprise Software Market (Hardcover): Useless, its about the industry, not about writing the software.
      • Scaling Software Agility: Best Practices for Large Enterprises (The Agile Software Development Series): Useless, it describes how to
  • Statelessness (Score:4, Interesting)

    by tweek (18111) on Thursday August 23, 2007 @12:13PM (#20331559) Homepage Journal
    I don't know if anyone has mentioned it but the key to a web application being scalable horizontally is statelessness. It's much easier to throw another server behind the load balancer than it is to upgrade the capacity on one. I've never been a fan of sticky sessions myself. This requires a different approach to development in terms of memory space and what not. With a horizontally scalable front tier, you can't always guarantee that someone will be talking to the same server on the next request that they were on the previous request. It requires a little more overhead in terms of either replicating the contents of memory between all application servers or on the database tier because you persist everything to the database.

    At least that's my opinion.
    • by sapgau (413511)
      Agreed. I've read this somewhere else confirming the same reasons on why to eventually design your application for scalability from the start no matter how small the application.

      It is a greater headache to convert for example your webapp from file based cookies to persistent cookies in the DB. Of course trying to be careful not to serialize too much data in cookies either.

      my $0.02
    • by demi (17616) *

      Sort of, but this is a naïve understanding. State is required or you're not doing anything interesting. You're just pushing it around. If you're the guy in charge of the web servers, it might seem like having sessions on the app server and plugin based request routing is a good idea, but it just pushes the problem to the app server guy. If you're the app server guy, it might seem like a good idea to put sessions in a database but that just pushes the state there. You're not solving any fundamental prob

    • Unless you have an app that needs to be very tightly written I think the easiest way to write a scalable app is just to break the app down into components that can each be on their own server or duplicated across multiple servers. If each component isn't keeping state for itself then it doesn't matter which copy of a given component you make a request on so you can split tasks between copies with simple load balancing techniques. This also helps keep your application code clean as it makes you keep your com
  • Developers need not worry about HA too much. Your IT department should be able to set this up for you rather seamlessly. With things like LVS/Keepalived you can easily implement load balancing and auto-failover for databases, web servers, etc (you don't even need to code in multiple DB servers; VRRP works wonders for this kind of thing.) As long as the application is designed sanely to begin with, HA as it is typically discussed comes down to minimizing the impact of hardware failure by buying two of everyt
    • Re: (Score:2, Insightful)

      by dennypayne (908203)

      The first two sentences here are one of my biggest pet peeves...if application developers don't start becoming more network-aware, and vice versa, I think you're dead meat. Hint: there are very few applications these days that aren't accessed over the network. I see so many "silos" like this when I'm consulting. The network guys and the app guys have no idea what the "other side" does. Where if they actually worked together on these type issues instead of talking past each other, something actually migh

  • by ChrisA90278 (905188) on Thursday August 23, 2007 @12:25PM (#20331771)
    You are talking about two things: reliability and performance. And there are two ways to measure performance: Latency (what one end user sees) and through put (number of transactions per unit time). You have to decide what to address.

    You can address reliability and through put by invest a LOT of money in hardware and using things like round robin load balancing, clusters and mirrored DBMSes, RAID 5 and so on. Then losing a power supply or a disk drive means only degraded performance.

    Latency is hard to address. You have to profile and collect good data. You may have to write test tools to measure parts of the system in isolation. You need to account for every millisecond before you can start shaving them off

    Of course you could take a quick look for obvious stuff like poorly designed SQL data bases, lack of indexes on joined tables and cgi-bin scripts that require a process to be strarted each time they are called.

  • Lots of Options (Score:3, Interesting)

    by curmudgeon99 (1040054) <curmudgeon99NO@SPAMgmail.com> on Thursday August 23, 2007 @12:34PM (#20331897)

    First of all, excellent question.

    Second: ignore the ass above who said dump Java. Modern hotspots have made Java as fast or faster than C/C++. The guy is not up to date.

    Third: Since this is a web app, are you using an HttpSession/sendRedirect or just a page-to-page RequestDispatcher/forward? As much as its a pain in the ass--use the RequestDispatcher.

    Fourth: see what your queries are really doing by looking at the explain plan.

    Five: add indexes wherever practical.

    Six: Use AJAX wherever you can. The response time for an AJAX function is amazing and it is really not that hard to do Basic AJAX [googlepages.com].

    Seven: Use JProbe to see where your application is spending its time. You should be bound by the database. Anything else is not appropriate.

    Eight: Based on your findings using JProbe, make code changes to, perhaps, put a frequently-used object from the database into a class variable (static).

    These are several ideas that you could try. The main thing that experience teaches is this: DON'T optimize and change your code UNTIL you have PROOF of where the slow parts are.

    • Re: (Score:3, Informative)

      by nuzak (959558)
      The VM speed is not Java's problem. The decrepit servlet architecture, which was designed from the start around one-thread-per-request, is. Anything that fixes this architecture is essentially a patch on a broken system. Even if you escape, you're going to find that many JavaEE components will have you buying stock in RAM manufacturers.

      A good JMS provider is nice to have for HA though. Nothing like durable message storage to help you sleep well.
    • Re:Lots of Options (Score:4, Insightful)

      by Wdomburg (141264) on Thursday August 23, 2007 @03:22PM (#20334349)
      Six: Use AJAX wherever you can. The response time for an AJAX function is amazing and it is really not that hard to do Basic AJAX.

      AJAX can be a performance win. It can also be a nightmare if done poorly. I've seen far too many "web 2.0" applications that flood servers with tons of AJAX calls that return far too little data without a consideration for the cost (TCP connections aren't free, logging requests isn't free).

      Response time is also variable. What feels 'amazing' local to the server can be annoyingly slow over an internet connection, especially if the design is particularly interactive.

      Couple things I'd suggest:

      1) Don't do usability testing on a LAN. An EV-DO card wouldn't be a bad choice for an individual. For a larger scale development environment a secondary internet connection works well.

      2) Remember that a page can be dynamic without AJAX. Response time toggling the display property of an object is far more impressive than establishing a new network connection and fetching the data.

      3) Isolate AJAX interfaces in their own virtual host so that you can use less verbose logging for API calls. This is a good idea for images as well.
  • by Jailbrekr (73837) <jailbrekr@digitaladdiction.net> on Thursday August 23, 2007 @12:39PM (#20331963) Homepage
    you work for Skype, don't you?
  • It sounds like you need to some basic disaster planning. Think in terms of "what if this happens?"

    Like you loose your data center? How good is your backup, is it off site, do you have a tested plan for restoring the data and system on an interm basis on someone's system?

    Then you can look at some more specific things, what happens if I loose this server, this connection, this router, and specific services, DNS, Email, etc.

    The big question $$$ depends on how much you have to loose. If you can afford a day of
  • These really are 2 different things. Though they do sometimes cross over - oracle RAC is a good example of that.

    As for where to read from a developer perspective? (which alot of people replying seemed to have missed the actual question). There are TONNES.

    But split the question in two, where can i read about HA:
    start here-> http://en.wikipedia.org/wiki/High_availability [wikipedia.org] Theres also many books on the subject (i remember one of the few i happened to like is the things that came out of the sun blueprints boo
  • Look for a job where they got lots of oltp!
  • Software Engineering for Internet Applications [greenspun.com] will guide you in the right direction.
  • On modern hardware, on an internal network, "a second or two" is an eternity. Instead of worrying about what would happen if all 60,000 people used the app at once (unlikely), I'd find the bottlenecks you have now and fix those.

    Prioritize. You have statistics already about typical usage, and typical wait and service times. Fix the problem that exists, instead of the problem that doesn't, but might someday.

    • by Shados (741919)
      for many complex, distributed, multi-tier applications, especially those doing heavy real time calculations across multiple systems, and web apps doing pretty specific user session monitoring and personalisation, a second or two is pretty freagin blazing actually. A second or two is only an eternity if talking about a stand alone system doing simple stuff with only a few boundaries to cross, if any.
      • by Bluesman (104513)
        From the original question:

        "The type of work being done is generally straightforward reads or updates that typically hit two or three DB tables per transaction. So this isn't a complicated site and the usage is pretty low."

        There is no reason this should taking multiple seconds. He has a basic problem there. Now is not the time to be thinking about multiple distributed tiers, as much as he wishes he were working on a really complex, cool system. He's not.

        All of the people chiming in with their own details
        • by Shados (741919)
          Definately possible, yes. Though when you have 60 thousand users, the 2-3 tables per transactions you're hitting could very well be hundreds of gigs. And the database is only ONE part of the equation in a big system, and definately can take multiple seconds. That a transaction takes 1-2 sec doesn't take away from the SSL and authentication, querying the user specific properties. The app may be graphic intensive, who knows :) Thats why I was just replying to the main question which had to do with high availa
  • You can get massive savings in processing by using various caching techniques. Caching lets you save the results of one process for use later.

    1. Client side cache. Most developers shudder when they think of a web page being cached on the browser. However, some pages (like help pages, new articles) do not change with real time and can be stored on the client's browser for a few minutes. Learn how to use the HTTP Caching directives to reduce the number of unique pages requested by each user.

    2. HTML Output
  • Blueprints for High Availability , Evan Marcus and Hal Stern, second edition. http://www.amazon.com/Blueprints-High-Availabilit y -Evan-Marcus/dp/0471430269/ref=cm_taf_title_featur ed?ie=UTF8&tag=tellafriend-20 [amazon.com]

    Deals with the subject of high availability from the IT side rather than programming, but anyone dealing with HA systems needs to understand these issues.
  • Well ... I've read what others wrote here and I don't think you got many actual answers (welcome to slashdot :) ).

    While there were some (very) good points about both scalability and HA, they didn't tell you how to go about learning that; HA and HS are two areas where by reading the books or following case studies, you can understand the basic problems, but not see how you actually go about building a particularly scalable or HA system (because it's usually a system, not a single server).

    I've worked in maint

  • The users of our apps are business professionals who are forced to use them, so they are are more tolerant of access times being a second or two slower than they could be.


    Actually, being forced to use your app doesn't make them more tolerant of delays. It makes *you* more tolerant because your users can't go away. They still hate the delays.

  • by Shados (741919) on Thursday August 23, 2007 @06:20PM (#20336637)
    Task queuing to deal with server downtimes, and horizontal scalability.

    The first is handled by just about any messaging/queue system. J2EE has had one for ages, Microsoft has MSMQ that recently (better late than never... ::SLAPS::) integrated it directly in .NET via WCF, and there are others. In its simplest form, you really just send your jobs to a "queue", and have automated processes pick em up and handle em. If the processes go down, they'll just handle them when they get back up, so even a whole database server farm going down at the same time won't make you lose queued up requests. Nifty (it of course gets more complicated than that, but the basic scenarios can be learned by following an internet tutorial).

    Then horizontal scaling. Why horizontal? Because just taking a random new box and plugging in it the network is easier and faster (especially in case of emergency) than having to take servers down to upgrade them (vertically scaling). Also adds to redundancy, so the more servers you add to your farm, the less likely your system will go down. There are documents on it all over (Microsoft Patterns&Practices has some on their web sites, non-MS documentation is hard to miss if you google for it, and many third partys will be more than happy to spam you with their solutions), but it really just come down to: "Use an RDBMS that handles clustering and table partitioning, use distributed caching solutions, push as much stuff on the client side (stuff that doesn't need to be trusted only!), and make sure that nothing ever depends on ressources that can only be accessed from a single machine (think local flat files, in process session management, etc)".

    With that, no matter what goes down, things go on purring, and if someone ever bitch that the system is slow, you just buy a 1000$ server, stick a standard pre-made image on the disk, plug it in, have fun.

    Oh, and fast network switches are a must :)

Dennis Ritchie is twice as bright as Steve Jobs, and only half wrong. -- Jim Gettys

Working...