Slashdot Log In
Learning High-Availability Server-Side Development?
Posted by
kdawson
on Thu Aug 23, 2007 09:57 AM
from the servers-not-breaking-a-sweat dept.
from the servers-not-breaking-a-sweat dept.
fmoidu writes "I am a developer for a mid-size company, and I work primarily on internal applications. The users of our apps are business professionals who are forced to use them, so they are are more tolerant of access times being a second or two slower than they could be. Our apps' total potential user base is about 60,000 people, although we normally experience only 60-90 concurrent users during peak usage. The type of work being done is generally straightforward reads or updates that typically hit two or three DB tables per transaction. So this isn't a complicated site and the usage is pretty low. The types of problems we address are typically related to maintainability and dealing with fickle users. From what I have read in industry papers and from conversations with friends, the apps I have worked on just don't address scaling issues. Our maximum load during typical usage is far below the maximum potential load of the system, so we never spend time considering what would happen when there is an extreme load on the system. What papers or projects are available for an engineer who wants to learn to work in a high-availability environment but isn't in one?"
Related Stories
This discussion has been archived.
No new comments can be posted.
Learning High-Availability Server-Side Development?
|
Log In/Create an Account
| Top
| 207 comments
| Search Discussion
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
2 words (Score:2, Informative)
(http://www.andr0meda.com/ | Last Journal: Monday September 10, @06:29AM)
map reduce
http://labs.google.com/papers/mapreduce.html [google.com]
Re:2 words (Score:5, Funny)
(http://www.last.fm/user/uhlume/)
No fallacy.... (Score:5, Insightful)
Yes. If you take that sentence in context, the answer is "Yes." Compared to the likelihood that one of the thousands of worker-machines will fail during any given job, it IS unlikely that the single Master will fail. Moreover, while any given job may take hours to run, it also seems that many take just moments. Furthermore, just because a job may take hours to run doesn't mean it's CRITICAL that it be completed in hours. And, at times when a job IS critical, that scenario is addressed in the preceeding sentence: It is easy for a caller to make the master write periodic checkpoints that the caller can use to restart a job on a different cluster on the off-chance that a Master fails.
If a job is NOT critical, the master fails, the caller determines the failure by checking for the abort-condition, and then restarts the job on a new cluster.
It's not a logical fallacy, nor is it a bad design.
For the benefit of anyone reading thru, here is the parapgraph in question. It follows a detailed section on how the MapReduce library copes with failures in the worker machines.
Re:2 words (Score:5, Informative)
(http://sc.tri-bit.com/ | Last Journal: Sunday July 08, @02:36AM)
The Macy's Door Problem is a great example of a Scalability 101 problem that map_reduce has no way to address. In the early 30s, when most department stores were making big, flashy front entrances to their stores with big glass walls and paths for 12 groups of people at a time, doormen, signage, the whole lot, Macy's elected to take a different approach. They set up a small door with a sign above it. The idea was simple: if there was just the one door, it would be a hassle to get in and out of the store; thus, it would always look like there was a crowd struggling to get in - as if the store was just so popular that they couldn't keep up with customer foot traffic. The idea worked famously well.
In server design, we use that as a metaphor for near-redline usage. There's a problem that's common in naïve server design, where the server will perform just fine right up to 99%. Then, there'll be a tiny usage spike, and it'll hit 101% very briefly. However, the act of queueing and disqueueing withheld users is more expensive than processing a user, meaning that even though the usage drops back to 99%, by the time those 2% overqueue have been processed, a new 3% overqueue has formed, and performance progressively drops through the floor on a load the application ought to be able to handle. I should point out that Apache has this problem, and that until six years ago, so did the Linux TCP stack. It's a much more common scalability flaw than most people expect.
Now, that's just one issue in scalability; there are dozens of others. However, map_reduce has literally nothing to say to that problem. Do I need to rattle off others too, or maybe is that good enough? I mean, we have the exponential growth of client interconnections (Metcalfe's Law, which is easily solved with a hub process;) we have making sure that processing workloads is linear growth (that is, o(1) as opposed to o(lg n) or worse), which means no std::map, no std::set, no std::list, only pre-sized std::vector and very careful use of hash tables; we have packet fragmentation throttling; we have making sure that you process all clients in order, to prevent response-time clustering (like when you load an apache site and it sits there for five seconds, so you hit reload and it comes up instantly,) all sorts of stuff. Most scalability issues are hard to explain, but maybe that brief list will give you the idea that scalability is a whole lot bigger of an issue than some silly little google library.
Talk to someone who's tried to write an IRC server. Those things hit lots of scalability problems very early on. That community knows the basics very, very well.
Well... (Score:4, Informative)
(http://ofteninspired.com/ | Last Journal: Sunday April 01 2007, @05:49PM)
Zawodny is pretty good...
Re:Well... (Score:4, Informative)
(http://www.newmediatwins.net/)
The unfortunate thing about databases (Score:4, Insightful)
(http://www.int64.org/)
Is that most of them have poor native APIs when it comes to scalability. Some of them have something like
But that is far from optimal. When will they be smart and release an async API that notifies you via callback when complete? This would be very useful for apps that need maximum scalability.
Microsoft's .NET framework is actually a great example of doing the right thing - it has these types of async methods all over the place. But then you have to deal with cross-platform issues and problems inherent with a GC.
It's not that much different for web frameworks either. None that I've tried (RoR, PHP, ASP.NET) have support for async responding - they all expect you to block execution should you want to query a db/file/etc. and just launch boatloads of threads to deal with concurrent users. I guess right now with hardware being cheaper it is easier to support rapid development and scale an app out to multiple servers.
Re:The unfortunate thing about databases (Score:5, Informative)
(http://www.sergiocarvalho.com/)
Most databases have async APIs. Postgresql and mysql have them in the C client libraries. Most web development languages, though, do not expose this feature in the language API, and for good reason. Async calls can, in rare cases, be useful for maximizing the throughput of the server. Unfortunately, they're more difficult to program, and much more difficult to test.
High scale web applications have thousands of simultaneous clients, so the server will never run out of stuff to do. Async calls have zero gain in terms of server throughput (requests/s). It may reduce a single request execution time, but the gain does not compensate the added complexity.
look at Saas development (Score:2)
Here goes... (Score:3, Informative)
1. Check all your SQL and run it through whatever profiler you have. Move things into views or functions if possible.
2. CHECK YOUR INDEXES!! If you have SQL statements running that slow, the likely cause is not having proper indexes for the statements. Either make an index or change your SQL.
3. Consider using caching. For whatever platform you're on there's bound to be decent caching.
That's just the beginning... but the likely cause of most of your problems. We could go on for a month about optimizing.. but in the end if you just stuck with what you have and checked your design for bottlenecks you could get by just fine.
High availability!=high performance (Score:5, Insightful)
(http://dis29500.org/)
One Of Many Possible Futures (Score:1)
I don't code for it directly (Score:4, Interesting)
Many of the code guidelines we have established are to aid in this. Use transactions, don't lock tables, use stored procedures and views for anything complicated, things like that.
I guess my answer is that we delegate it to the server group or the dba group and let them deal with it. I guess this means the admins there are pretty good at what they're doing.
Watch videos/presentations. (Score:3, Informative)
(http://fsfe.org/)
2 Cents (Score:1)
check these out... (Score:4, Informative)
http://highscalability.com/ [highscalability.com]
http://www.allthingsdistributed.com/ [allthingsdistributed.com]
Not sure what you want to test. (Score:4, Insightful)
If you are using Java on Tomcat, BEA, or Websphere, use a product like PerformaSure to see a call tree of where your Java program is spending it's time. Sorts out how long each SQL takes too, and shows you what you actually sent. If you have external data sources, like SiteMinder, it will show that too.
If you mean "What happens if we lose a bit of hardware" simulate the whole thing on VMware on a single machine and kill/suspend VMs to see how it reacts.
Most importantly, MAKE SURE YOU MODEL WHAT YOU ARE TESTING. IF you are not testing a scaled up version of what users actually do, you have a bad test.
Another option (Score:2)
(http://sc.tri-bit.com/ | Last Journal: Sunday July 08, @02:36AM)
Erlang is difficult to learn from what's on the web; consider starting with Joe's book [amazon.com].
Re:Slightly off topic (Score:4, Informative)
(http://sc.tri-bit.com/ | Last Journal: Sunday July 08, @02:36AM)
Basically, processes are primitives, there's no shared memory, communication is through message passing, fault tolerance is ridiculously simple to put together, it's soft realtime, and since it was originally designed for network stuff, not only is network stuff trivially simple to write, but the syntax (once you get used to it) is basically a godsend. Throw pattern matching a la Prolog on top of that, dust with massive soft-realtime scalability which makes a joke of well-thought-of major applications (that YAWS vs Apache [www.sics.se] image comes to mind,) a soft-realtime clustered database and processes with 300 bytes of overhead and no CPU overhead when inactive (literally none,) and you have a language with such a tremendously different set of tools that any attempt to explain it without the listener actually trying the language is doomed to fall flat on its face.
In Erlang, you can run millions of processes concurrently without problems. (Linux is proud of tens of thousands, and rightfully so.) Having extra processes that are essentially free has a radical impact on design; things like work loops are no longer nessecary, since you just spin off a new process. In many ways it's akin to the unix daemon concept, except at the efficiency level you'd expect from a single compiled application. Every client gets a process. Every application feature gets a process. Every subsystem gets a process. Suddenly, applications become trees of processes pitching data back and forth in messages. Suddenly, if one goes down, its owner just restarts it, and everything is kosher.
It's not the greatest thing since sliced bread; there are a lot of things that Erlang isn't good for. However, what you're asking for is Erlang's original problem domain. This is what Erlang is for. I know, it's a pretty big time investiture to pick up a new language. Trust me: you will make all your time back in writing far shorter, far more obvious code than you did in learning the language. You can pick up the basics in 20 hours. It's a good gamble.
Developing servers becomes *really* different when you can start thinking of them as swarms.
Re:Impressive (Score:4, Informative)
(http://sc.tri-bit.com/ | Last Journal: Sunday July 08, @02:36AM)
The problem is, it's hard to explain why. The overhead of using things like that is tremendous; Erlang's message system is used for quite literally all communication between processes, and a system like Windows Events or MSMQ would reduce Erlang applications to a crawl. Erlang uses an ordered, staged mailbox model, much like Smalltalk's. If you haven't used Smalltalk, then frankly I'm not aware of another parallel.
It's important to understand just how fundamental message passing is in Erlang. Send and receive are fundamental operators, and this is a language that doesn't have for loops, because it thinks they're too high level and inspecific (you can make them yourself; I know, that must sound crazy, but once you get it, it makes perfect sense.)You're about to see a completely different approach. I'm not saying it's the best, or the most flexible, but I really like it, and it genuinely is very different. What Erlang does can relatively straightforwardly be imitated with blocking and callbacks in C, but that involves system threads, and then you start getting locking and imperative behavior back, which is one of the things it's so awesome to get rid of (imagine - no more locks, mutexes, spin controls and so forth. Completely unnessecary, both in workload, debugging and in CPU time spent. It's a huge change.)
Really, it's a whole different approach. You've just got to learn it to get it.No, I said that. I wrote some code to help explain it to you, though of course slashdot's retarded lameness filters wouldn't pass it, so I put it behind this link [tri-bit.com]. Sorry it's not inline.
Hopefully that will help. Sorry about the lack of whitespace; SlashDot's amazingly lame lameness filter is triggering on clean, readable code.
Re:Slightly off topic (Score:4, Informative)
(http://theravensnest.org/ | Last Journal: Sunday October 07, @07:05AM)
- Very low overhead processes; creating a process in Erlang is only slightly more expensive than making a function call in C.
- Higher order functions.
- Pattern matching everywhere (e.g. function arguments, message receiving, etc). If you've want two different behaviours for a function depending on the structure of the data that it is passed (e.g. handlers for two different types of packet, with different headers) you can write two version of the function with a pattern in the argument.
- Guard clauses on functions, lets you implement design-by-contract and also lets you separate out validation of arguments from the body of a function, giving cleaner code.
- Simple message passing syntax, with pattern matching on message receive for out-of-order retrieval.
- Asynchronous message delivery; very scalable.
- Lists and tuples as basic language primitives.
- Gorgeous binary syntax. I've never seen a language as good as Erlang for manipulating binary data.
- Automatic mapping of Erlang processes to OS threads, allowing as many to run concurrently as you have CPUs.
- Network message delivery, allowing Erlang code with only slight modifications to send messages over the network rather than to local processes (the message sending code is the same, only how you acquire the process reference is different).
There are also a few down sides to the language:- The preprocessor is even worse than C's, so metaprogramming is hard (and badly needed; patterns like synchronous message sending or futures require a lot of copy-and-pasting).
- Implementing ADTs is ugly (but no worse than C).
- Variables are single static assignment, which is a cheap cop-out for the compiler writer and makes code convoluted at times.
- Message sending and function call syntax is very different for no good reason. You are meant to wrap exposed (public) messages in function, which makes things even more messy.
- Calling code in other languages is a colossal pain.
- The API is inconsistent (e.g. some modules manipulating ADTs take the ADT as the first argument, some take it as the last).
Erlang is a great language for a lot of tasks, particularly servers, but it's not suited for everything.Similar Question - Interviews (Score:2, Offtopic)
(http://overhrd.com/)
Sorry this is a bit off-topic; I've just been dying to ask the slashdot community and this seems to be the most appropriate forum for the question.
Languages, Libraries, Abstraction, Audience (Score:2, Interesting)
Languages - I recently saw a multi-million dollar product fail because of performance problems. A large part of it was that they wanted to build performance-critical enterprise server software, but wrote it mostly in a language that emphasized abstraction over performance, and was designed for portability, not performance. The language, of course, was Java. Before I get flamed about Java, the issue was not Java itself and alone, but part of it was indeed using a language not specifically designed for a key project objective: performance. The abstraction, I would argue, did the project worse than all the other peformance issues associated with bytecode however. Relevant books on this subject are everywhere.
Libraries - Using other people's code (e.g. search software, DB apps, etc.) will always introduce scalability weaknesses and performance costs in expected and unexpected places. Haphazardly choosing what software to get in bed to can come back to bite you later. It is an occupational hazard, and each database product and framework and even hardware configuration has its own pitfalls. Many IT book on enterprise performance or even whitepapers and academic papers can provide more information.
Abstraction - There is no free lunch. When you make things easier to code, you typically incure some performance penalty somewhere. In C++, Java, and most other high level languages, the sheer notion of modularity and abstraction eventually add so much hidden knowledge and code that developers either lose track of what subtle costs everything is incurring, or are suddenly put in a position where they can't go back and rewrite everything. Sometimes it is better to write a clean, low-level API and limit the abstraction eyecandy or it will come back to bite you. On the other hand, sometimes a poor low-level API is worse than a cleanly abstracted high-level API. In practive, few complex and performance-oriented systems are architected in very high level languages however. I have seen few books on this subject, and it is pure software engineering. Design patterns might help, however.
Audience - Both clientelle and developer audiences make a big difference. Give an idiot a hammer with no instructions... and you get the point. Make sure your developers know what they're doing and what priorities are, and also design your interfaces and manuals in such a way as to keep scalability in mind. Why have a script perform a hundred macro operations when a well-designed API could provide better performance with a single call? This entails both HCI and project development experience.
Wish I could suggest more books, but there's just too many.
Re:Languages, Libraries, Abstraction, Audience (Score:5, Insightful)
Libraries - Bingo lets throw out nice blocks of tested and working code b/c it's always better to write it yourself. You pretty much have to use libraries to get things done anymore. And are you suggesting someone should write their own DB software when building a web app? Um, yeah see that web app ever gets done.
Abstractions - While most are leaky at some point, abstractions make it easier for you to focus on the architecture (which is what you should be focusing on anyways when building scalable systems).
I see these types of arguments all the time and they rarely make sense. It's like arguing about C vs. Java over 1ms running time difference when if you changed your algorithm you could make seconds of difference or if you changed your architecture you would make minutes of difference...
HA is not load balancing (Score:1)
Draw a high level map of your application and all the server/network resources it uses. Take each one of those components and analyze them for load balancing and fault tolerance. Any single component failure should not affect the overall uptime of the application. Part of a high-availability system is having proper monitoring and notification tools in place. It takes a lot to make a high availability environment work and some of it is not engineering related, but business process related. If your servers are in a data center and a database server goes down, yet your notification system sends an email to a database developer who works 9am to 5pm (maybe on vacation) alerting him/her of the issue... You can see how this can lead to problems. Proper health checks, escalation paths, etc. are all part of making your system work.
My $0.02.
Use your strengths (Score:1)
(http://www.geocities.com/tablizer | Last Journal: Saturday March 15 2003, @01:22PM)
Great Book (Score:1)
(http://petersfamily.org/)
Great tutorial at OSCON about it too.
But the biggest piece of advice is to never guess about where things are slow. Measure them and then fix the slow parts. Don't change a thing until you've benchmarked it.
Availability Isn't Scalability (Score:3, Insightful)
(http://www.otakubooty.com/)
Is it just me, or is the question hopelessly confused? He's using the term "availability" but it sounds like he's talking about "scalability."
Availability is basically percentage of uptime. You achieve that with hot spares, mirroring, redundancy, etc. Scalability is the ability to perform well as workloads increase. Some things (adding load-balanced webservers to a webserver farm) address both issues, of course, but they're largely separate issues.
The first thing this poster needs to do is get a firm handle on exactly WHAT he's trying to accomplish, before he can even think about finding resources to help him do it.
Define your thread's purposes (Score:5, Informative)
(http://www.goth.net/~prez)
The biggest problem I have seen is people don't know how to properly define their thread's purpose and requirements, and don't know how to decouple tasks that have in-built latency or avoid thread blocking (and locking).
For example, often in a high-performance network app, you will have some kind of multiplexor (or more than one) for your connections, so you don't have a thread per connection. But people often make the mistake of doing too much in the multiplexor's thread. The multiplexor should ideally only exist to be able to pull data off the socket, chop it up into packets that make sense, and hand it off to some kind of thread pool to do actual processing. Anything more and your multiplexor can't get back to retrieving the next bit of data fast enough.
Similarly, when moving data from a multiplexor to a thread pool, you should be a) moving in bulk (lock the queue once, not once per message), AND you should be using the Least Loaded pattern - where each thread in the pool has its OWN queue, and you move the entire batch of messages to the thread that is least loaded, and next time the multiplexor has another batch, it will move it to a different thread because IT is least loaded. Assuming your processing takes longer than the data takes to be split into packets (IT SHOULD!), then all your threads will still be busy, but there will be no lock contention between them, and occasional lock contention ONCE when they get a new batch of messages to process.
Finally, decouple your I/O-bound processes. Make your I/O bound things (eg. reporting via. socket back to some kind of stats/reporting system) happen in their own thread if they are allowed to block. And make sure your worker threads aren't waiting to give the I/O bound thread data - in this case, a similar pattern to the above in reverse works well - where each thread PUSHING to the I/O bound thread has its own queue, and your I/O bound thread has its own queue, and when it is empty, it just collects the swaps from all the worker queues (or just the next one in a round-robin fashion), so the workers can put data onto those queues at its leisure again, without lock contention with each other.
Never underestimate the value of your memory - if you are doing something like reporting to a stats/reporting server via. socket, you should implement some kind of Store and Forward system. This is both for integrity (if your app crashes, you still have the data to send), and so you don't blow your memory. This is also true if you are doing SQL inserts to an off-system database server - spool it out to local disk (local solid-state is even better!) and then just have a thread continually reading from disk and doing the inserts - in a thread not touched by anything else. And make sure your SAF uses *CYCLING FILES* that cycle on max size AND time - you don't want to keep appending to a file that can never be erased - and preferably, make that file a memory mapped file. Similarly, when sending data to your end-users, make sure you can overflow the data to disk so you don't have 3mb data sitting in memory for a single client, who happens to be too slow to take it fast enough.
And last thing, make sure you have architected things in a way that you can simply start up a new instance on another machine, and both machines can work IN TANDEM, allowing you to just throw hardware at the problem once you reach your hardware's limit. I've personally scaled up an app from about 20 machines to over 650 by ensuring the collector could handle multiple collections - and even making sure I could run multiple collectors side-by-side for when the data is too much for one collector to crunch.
I don't know of any papers on this, but this is my experience writing extremely high
Has anyone actually answered the question? (Score:3, Insightful)
What you are asking about, of course, is enterprise-grade software. This typically involves an n-tier solution with massive attention to the following:
- Redundancy.
- Scalability.
- Manageability.
- Flexilibility.
- Securability.
- and about ten other "...abilities."
The classic n-tier solution, from top to bottom is:
- Presentation Tier.
- Business Tier.
- Data Tier.
All of these tiers can be made up of internal tiers. (For example, the Data Tier might have a Database and a Data Access / Caching Tier. Or the Presentation Tier can have a Presentation Logic Tier, then the Presentation GUI, etc.)
Anyway, my point is simply that there is a LOT to learn in each tier. I'd recommend hitting up good ol' Amazon with the search term "enterprise software" and buy a handful of well-received books that look interesting to you (and it will require a handful):
http://www.amazon.com/s/ref=nb_ss_gw/002-8545839-
Hope this helps.
Statelessness (Score:4, Interesting)
(http://dev.lusis.org/ | Last Journal: Monday December 02 2002, @11:39PM)
At least that's my opinion.
HA is an IT thing (Score:2)
You are talking about two things (Score:3, Informative)
You can address reliability and through put by invest a LOT of money in hardware and using things like round robin load balancing, clusters and mirrored DBMSes, RAID 5 and so on. Then losing a power supply or a disk drive means only degraded performance.
Latency is hard to address. You have to profile and collect good data. You may have to write test tools to measure parts of the system in isolation. You need to account for every millisecond before you can start shaving them off
Of course you could take a quick look for obvious stuff like poorly designed SQL data bases, lack of indexes on joined tables and cgi-bin scripts that require a process to be strarted each time they are called.
Lots of Options (Score:3, Interesting)
(http://freejavalectures.googlepages.com/)
First of all, excellent question.
Second: ignore the ass above who said dump Java. Modern hotspots have made Java as fast or faster than C/C++. The guy is not up to date.
Third: Since this is a web app, are you using an HttpSession/sendRedirect or just a page-to-page RequestDispatcher/forward? As much as its a pain in the ass--use the RequestDispatcher.
Fourth: see what your queries are really doing by looking at the explain plan.
Five: add indexes wherever practical.
Six: Use AJAX wherever you can. The response time for an AJAX function is amazing and it is really not that hard to do Basic AJAX [googlepages.com].
Seven: Use JProbe to see where your application is spending its time. You should be bound by the database. Anything else is not appropriate.
Eight: Based on your findings using JProbe, make code changes to, perhaps, put a frequently-used object from the database into a class variable (static).
These are several ideas that you could try. The main thing that experience teaches is this: DON'T optimize and change your code UNTIL you have PROOF of where the slow parts are.
Re:Lots of Options (Score:4, Insightful)
AJAX can be a performance win. It can also be a nightmare if done poorly. I've seen far too many "web 2.0" applications that flood servers with tons of AJAX calls that return far too little data without a consideration for the cost (TCP connections aren't free, logging requests isn't free).
Response time is also variable. What feels 'amazing' local to the server can be annoyingly slow over an internet connection, especially if the design is particularly interactive.
Couple things I'd suggest:
1) Don't do usability testing on a LAN. An EV-DO card wouldn't be a bad choice for an individual. For a larger scale development environment a secondary internet connection works well.
2) Remember that a page can be dynamic without AJAX. Response time toggling the display property of an object is far more impressive than establishing a new network connection and fetching the data.
3) Isolate AJAX interfaces in their own virtual host so that you can use less verbose logging for API calls. This is a good idea for images as well.
Admit it.... (Score:5, Funny)
(http://www.digitaladdiction.net)
Disaster planning (Score:2)
(http://slashdot.org/)
Like you loose your data center? How good is your backup, is it off site, do you have a tested plan for restoring the data and system on an interm basis on someone's system?
Then you can look at some more specific things, what happens if I loose this server, this connection, this router, and specific services, DNS, Email, etc.
The big question $$$ depends on how much you have to loose. If you can afford a day of downtime, you don't have to spend as much effort on HA as say the NYSE, or an airline.
Service Availability Forum (Score:1)
(Last Journal: Tuesday November 21 2006, @11:04AM)
High availablitiy and scalability... (Score:2)
As for where to read from a developer perspective? (which alot of people replying seemed to have missed the actual question). There are TONNES.
But split the question in two, where can i read about HA:
start here-> http://en.wikipedia.org/wiki/High_availability [wikipedia.org] Theres also many books on the subject (i remember one of the few i happened to like is the things that came out of the sun blueprints books). The problem with HA is it very subjective. You can talk about HA for say web applications and just talk session sharing and an intelligent load balancer (ironically, the same thing gives you scalability until you get to the DB) or you can talk all the way down to fault-tolerant hardware. Also take a look at the whitepapers that came out of such projects as mosix, VAX clusters, oracle HA (both RAC and dataguard), IBM Websphere (There alot in the various IBM sites about HA for all their products and one is bound to be similar to yours in nature), Sun J2EE. Alot of these do go into development aspects as well and give some fantastic concepts and paradigms to follow. But you really need to define the requirements for HA. i.e. 0 dt or 30 minutes dt is a HUGE difference! (and that really is just one scenario in many, and as a developer your usually faced with multiple requirements).
Scalability is a different issue - and usually very application and environment dependent. Again http://en.wikipedia.org/wiki/Scalability [wikipedia.org] is a good start but finding general literature is often very hard because its so dependent on the situation and the application.
Personally, i've found i learn best from example. Such things J2EE application servers (websphere, sun JES, etc), load balancing, oracle RAC vs dataguard, mysql ndb vs replication vs read-only replica database methods, apache, php, samba, windows (most things), pick just about any main-stream application and it'll almost certainly cover both HA and scalability at a level helpful to a developer.
If you want to get even more complex - take a look even cooler forms of scalability and HA that involve things like utility computing (vmware DRS, or egenera for eg). Have a look at their design documents because they offer even more diverse examples of both subjects at a more abstract layer (i.e. even below the OS and entirely on the HW)
In both cases, its hard to go from a "we weren't thinking about HA or scalability scenario when we build it" to "its HA and scalable". HA tends to be a little easier because clusters can wrap themselves around almost any situation, but scaling on such systems usually means "i need bigger, faster and more CPUs, more memory and better disk until i can figure out how to code scaling into it".
Always keep in mind though, the law of diminishing returns almost always applies.
How about your local paper? (Score:2)
Hard stuff (Score:1)
(in fact I've considered writing one on and off for years but sadly
the $$ I cam make as a consultant on performance, scalability and
availability far exceeds the likely rewards from publishing a book).
Best advice is to look at some open source projects that are used
in highly scalable applications. The other thing I'd say is that
there isn't one true technique -- at this point everyone makes up
their own solution as they go. Often the applications' characteristics
drive the scaling architecture so each application is different.
SEIA (Score:2)
A second or two? (Score:2)
(http://drblast.blogspot.com/)
Prioritize. You have statistics already about typical usage, and typical wait and service times. Fix the problem that exists, instead of the problem that doesn't, but might someday.
Cache what you can (Score:2)
(http://www.noprizes.net/)
1. Client side cache. Most developers shudder when they think of a web page being cached on the browser. However, some pages (like help pages, new articles) do not change with real time and can be stored on the client's browser for a few minutes. Learn how to use the HTTP Caching directives to reduce the number of unique pages requested by each user.
2. HTML Output caching. While some parts of a page may change, some elements (such as navigation elements, footers, etc) may not need to be recalculated with each page load. Many app servers let you save sections of the page so that once one user has generated them, you can reuse them again.
3. Database Caching. Frameworks like Hibernate allow you to cache the results of SQL calls so that if the same SQL is reissued (even between different users) the cache reads the result, not the database. Usually you can pick which calls are cached versus which ones have to be live.
Good book to read (Score:2)
Deals with the subject of high availability from the IT side rather than programming, but anyone dealing with HA systems needs to understand these issues.
Actual answers (Score:2)
(http://utnapistim.blogspot.com/)
While there were some (very) good points about both scalability and HA, they didn't tell you how to go about learning that; HA and HS are two areas where by reading the books or following case studies, you can understand the basic problems, but not see how you actually go about building a particularly scalable or HA system (because it's usually a system, not a single server).
I've worked in maintenance for a c++ server, where we gave the users guarantees of both low response time under stress and minimal down time.
While working with that, we've had to use different angles in attacking the appearing problems, and usually the solutions we used were particular for the problems couldn't be very well generalized.
For example, when different threads used the same input data, it was better in some situations to duplicate the data for each thread instead of using locks on it, so you didn't incur the delays involved in locking. In other places, we used locking as even with locking those places wouldn't create bottlenecks.
In some value-objects (large data collections), we used copy-on-write pattern with sharing objects values in the same thread (to minimize the allocations done and memory fragmentation), and deep copies when the data was needed to be sent in other threads.
We split the server in multiple different servers handling the different parts of processing (so we could throw hardware at the computation-heavy parts of the application logic) and use load-balancing.
We also used in-memory databases in one case, for storing some of the information.
Some IO operations (like logging) ran on different threads with messages passed to them (having multiple threads for logging for example).
For HA, we had a complete monitoring system, with processes listening for hartbeat and making load-statistics for the different modules, and every computation part had a backup server ready to take over at certain loads.
My point with these examples, is that neither of these solutions can be applied ad-hoc, but each possible bottleneck has to be studied separately and the solutions to avoid it can vary depending on a lot of factors (while you can get ideas, you won't learn scalability or HA from a book).
The best solution I can think of for learning about HS and HA is working in a product that needs them and getting direct experience in improving the scalability and availability.
An article I wrote last year may help you (Score:1)
(http://1wt.eu/)
an article I wrote last year about application scaling using load balancing may help you. It will not solve your problems but will certainly help you with the concepts, best practises and traps to avoid, which is a good starting point.
You can get read it online here : http://1wt.eu/articles/2006_lb/ [1wt.eu] or you can download it as a PDF here :
http://www.exceliance.fr/en/ART-2006-making%20app
Also, what you need is to perform benchmarks frequent during all the cycle of development of your applications. Using
traffic generators, you will simulate a lot of users and see how your application/database behaves. And believe me, it
never breaks where you expected it to ! On the first run, it's almost always caused my too much memory usage.
Then you optimize it (decrappify it in fact), then you break on concurrency (threads, processes, file descriptors,
sockets,
and when you finally saturate the frontend servers with 1% of your target load, you realize that you have to rewrite
everything using a faster language. But at least, you will be able to save time by starting on a few cheap servers, for
the time needed to translate the code for version 2.
You talked about 60000 users. If it's a population of 60000 users, it's not much. If it's 60000 concurrent users, it's
a huge load and you will have to educate yourself in network and operating system tuning, because tuning the app alone
will not be enough.
Good luck!
Willy
Tolerance of delays? (Score:2)
Actually, being forced to use your app doesn't make them more tolerant of delays. It makes *you* more tolerant because your users can't go away. They still hate the delays.
Really only 2 things to think about at the base (Score:3, Informative)
The first is handled by just about any messaging/queue system. J2EE has had one for ages, Microsoft has MSMQ that recently (better late than never...
Then horizontal scaling. Why horizontal? Because just taking a random new box and plugging in it the network is easier and faster (especially in case of emergency) than having to take servers down to upgrade them (vertically scaling). Also adds to redundancy, so the more servers you add to your farm, the less likely your system will go down. There are documents on it all over (Microsoft Patterns&Practices has some on their web sites, non-MS documentation is hard to miss if you google for it, and many third partys will be more than happy to spam you with their solutions), but it really just come down to: "Use an RDBMS that handles clustering and table partitioning, use distributed caching solutions, push as much stuff on the client side (stuff that doesn't need to be trusted only!), and make sure that nothing ever depends on ressources that can only be accessed from a single machine (think local flat files, in process session management, etc)".
With that, no matter what goes down, things go on purring, and if someone ever bitch that the system is slow, you just buy a 1000$ server, stick a standard pre-made image on the disk, plug it in, have fun.
Oh, and fast network switches are a must
Scale Up ... & Down (Score:2)
(http://alpha.edtinger.at/)
The users (Score:2)
I realize that this is just background to the question, but it strikes me as an odd thing so say:
I have been the victim of numerous really crappy internal applications. It makes me mad, because it shows a lack of respect for my work. What tends to happen in practice, if the users are technically minded, is that they don't use the applications as intended, and invent some primitive system on the side. Excel sheets, and so on.
And the people who support the real system are usually happily unaware of this. They live in a fantasy land, where things work fairly well and the users are pleased.
Re:Give it up! (Score:1)
Re:scalability (Score:1)