Learning High-Availability Server-Side Development?

Learning High-Availability Server-Side Development? 207

Posted by kdawson on Thursday August 23, 2007 @10:57AM from the servers-not-breaking-a-sweat dept.

fmoidu writes "I am a developer for a mid-size company, and I work primarily on internal applications. The users of our apps are business professionals who are forced to use them, so they are are more tolerant of access times being a second or two slower than they could be. Our apps' total potential user base is about 60,000 people, although we normally experience only 60-90 concurrent users during peak usage. The type of work being done is generally straightforward reads or updates that typically hit two or three DB tables per transaction. So this isn't a complicated site and the usage is pretty low. The types of problems we address are typically related to maintainability and dealing with fickle users. From what I have read in industry papers and from conversations with friends, the apps I have worked on just don't address scaling issues. Our maximum load during typical usage is far below the maximum potential load of the system, so we never spend time considering what would happen when there is an extreme load on the system. What papers or projects are available for an engineer who wants to learn to work in a high-availability environment but isn't in one?"

Learning High-Availability Server-Side Development?

This discussion has been archived. No new comments can be posted.

Search 207 Comments Log In/Create an Account

Comments Filter:

High availability!=high performance (Score:5, Insightful)

by dominux ( 731134 ) writes: on Thursday August 23, 2007 @11:17AM (#20330823) Homepage

start by being clear about what you want to achieve. If it is HA then you want to look at clustering, failover, network topology, DR plans etc. If it is HP then look for the bottlenecks in the process, don't waste time shaving nanoseconds off something that wasn't bothering anyone. At infrastructure level you might think about cacheing some stuff, or putting a reverse proxy in front of a cluster of responding servers. In general disk reads are expensive but easily cached, disk writes are very expensive and normally you don't want to cache them, at least not for very long. Network bandwidth may be fast or slow, latency might be an issue if you have a chatty application.

Not sure what you want to test. (Score:4, Insightful)

by funwithBSD ( 245349 ) writes: on Thursday August 23, 2007 @11:25AM (#20330941)

Stress testing? Use LoadRunner or some other tool to simulate users.

If you are using Java on Tomcat, BEA, or Websphere, use a product like PerformaSure to see a call tree of where your Java program is spending it's time. Sorts out how long each SQL takes too, and shows you what you actually sent. If you have external data sources, like SiteMinder, it will show that too.

If you mean "What happens if we lose a bit of hardware" simulate the whole thing on VMware on a single machine and kill/suspend VMs to see how it reacts.

Most importantly, MAKE SURE YOU MODEL WHAT YOU ARE TESTING. IF you are not testing a scaled up version of what users actually do, you have a bad test.

Re:2 words (Score:2, Insightful)

by andr0meda ( 167375 ) writes: on Thursday August 23, 2007 @11:35AM (#20331037) Journal

Well, it's easy to say something isn't 'A', and then not spend a word on what IS 'A'.

If I'm so wrong on scalability maybe you can explain it here to me. Thanks.

Slightly off topic (Score:3, Insightful)

by Gazzonyx ( 982402 ) writes: <scott.lovenberg@gm a i l.com> on Thursday August 23, 2007 @11:41AM (#20331117)

I keep hearing about Erlang being the next greatest thing since sliced bread... unfortunately, I don't have time to look into it too much. Could someone give me an 'elevator' pitch on what makes it so great for threading? Is it encapsulation based objects, a thread base class, or what? How does it handle cache coherency on SMP?

The unfortunate thing about databases (Score:4, Insightful)

by PhrostyMcByte ( 589271 ) writes: <phrosty@gmail.com> on Thursday August 23, 2007 @11:41AM (#20331119) Homepage

Is that most of them have poor native APIs when it comes to scalability. Some of them have something like

handle = query("SELECT...");
/*do something*/
result = wait(handle);

But that is far from optimal. When will they be smart and release an async API that notifies you via callback when complete? This would be very useful for apps that need maximum scalability.

Microsoft's .NET framework is actually a great example of doing the right thing - it has these types of async methods all over the place. But then you have to deal with cross-platform issues and problems inherent with a GC.

It's not that much different for web frameworks either. None that I've tried (RoR, PHP, ASP.NET) have support for async responding - they all expect you to block execution should you want to query a db/file/etc. and just launch boatloads of threads to deal with concurrent users. I guess right now with hardware being cheaper it is easier to support rapid development and scale an app out to multiple servers.

Availability Isn't Scalability (Score:3, Insightful)

by John_Booty ( 149925 ) writes: <johnbooty@booty p r o j e c t . o rg> on Thursday August 23, 2007 @11:44AM (#20331145) Homepage

From what I have read in industry papers and from conversations with friends, the apps I have worked on just don't address scaling issues. Our maximum load during typical usage is far below the maximum potential load of the system, so we never spend time considering what would happen when there is an extreme load on the system.

Is it just me, or is the question hopelessly confused? He's using the term "availability" but it sounds like he's talking about "scalability."

Availability is basically percentage of uptime. You achieve that with hot spares, mirroring, redundancy, etc. Scalability is the ability to perform well as workloads increase. Some things (adding load-balanced webservers to a webserver farm) address both issues, of course, but they're largely separate issues.

The first thing this poster needs to do is get a firm handle on exactly WHAT he's trying to accomplish, before he can even think about finding resources to help him do it.

Has anyone actually answered the question? (Score:3, Insightful)

by smackenzie ( 912024 ) writes: on Thursday August 23, 2007 @11:50AM (#20331235)

I see a lot of recommendations for various technologies, software packages, etc. -- but I don't think this addresses the original question.

What you are asking about, of course, is enterprise-grade software. This typically involves an n-tier solution with massive attention to the following:

- Redundancy.
- Scalability.
- Manageability.
- Flexilibility.
- Securability.
- and about ten other "...abilities."

The classic n-tier solution, from top to bottom is:

- Presentation Tier.
- Business Tier.
- Data Tier.

All of these tiers can be made up of internal tiers. (For example, the Data Tier might have a Database and a Data Access / Caching Tier. Or the Presentation Tier can have a Presentation Logic Tier, then the Presentation GUI, etc.)

Anyway, my point is simply that there is a LOT to learn in each tier. I'd recommend hitting up good ol' Amazon with the search term "enterprise software" and buy a handful of well-received books that look interesting to you (and it will require a handful):

http://www.amazon.com/s/ref=nb_ss_gw/002-8545839-8 925669?initialSearch=1&url=search-alias%3Daps&fiel d-keywords=enterprise+software+ [amazon.com]

Hope this helps.

Re:Languages, Libraries, Abstraction, Audience (Score:5, Insightful)

by EastCoastSurfer ( 310758 ) writes: on Thursday August 23, 2007 @11:56AM (#20331339)

Language - Doesn't matter much if you know how to design a scalable system. Some languages like Erlang force you into a more scalable design, but even then it's still easy to mess up. Unless this multi-million dollar project you're talking about was an embedded system I would bet language used was the smallest reason for bad performance. Although it is fun to bash java whenever the chance.

Libraries - Bingo lets throw out nice blocks of tested and working code b/c it's always better to write it yourself. You pretty much have to use libraries to get things done anymore. And are you suggesting someone should write their own DB software when building a web app? Um, yeah see that web app ever gets done.

Abstractions - While most are leaky at some point, abstractions make it easier for you to focus on the architecture (which is what you should be focusing on anyways when building scalable systems).

I see these types of arguments all the time and they rarely make sense. It's like arguing about C vs. Java over 1ms running time difference when if you changed your algorithm you could make seconds of difference or if you changed your architecture you would make minutes of difference...

Re:HA is an IT thing (Score:2, Insightful)

by dennypayne ( 908203 ) writes: on Thursday August 23, 2007 @01:31PM (#20332659) Homepage

The first two sentences here are one of my biggest pet peeves...if application developers don't start becoming more network-aware, and vice versa, I think you're dead meat. Hint: there are very few applications these days that aren't accessed over the network. I see so many "silos" like this when I'm consulting. The network guys and the app guys have no idea what the "other side" does. Where if they actually worked together on these type issues instead of talking past each other, something actually might get done.

So yes, developers absolutely need to worry about HA. It makes a difference whether your app is stateless or not. How the app is health-checked from the load balancer. How chatty the app is on the network. Etc, etc...

Denny

No fallacy.... (Score:5, Insightful)

by encoderer ( 1060616 ) writes: on Thursday August 23, 2007 @02:20PM (#20333479)

Huh? So, because there is only one master it is unlikely to fail?

Yes. If you take that sentence in context, the answer is "Yes." Compared to the likelihood that one of the thousands of worker-machines will fail during any given job, it IS unlikely that the single Master will fail. Moreover, while any given job may take hours to run, it also seems that many take just moments. Furthermore, just because a job may take hours to run doesn't mean it's CRITICAL that it be completed in hours. And, at times when a job IS critical, that scenario is addressed in the preceeding sentence: It is easy for a caller to make the master write periodic checkpoints that the caller can use to restart a job on a different cluster on the off-chance that a Master fails.

If a job is NOT critical, the master fails, the caller determines the failure by checking for the abort-condition, and then restarts the job on a new cluster.

It's not a logical fallacy, nor is it a bad design.

For the benefit of anyone reading thru, here is the parapgraph in question. It follows a detailed section on how the MapReduce library copes with failures in the worker machines.

It is easy to make the master write periodic checkpoints of the master data structures described above. If the master task dies, a new copy can be started from the last checkpointed state. However, given that there is only a single master, its failure is unlikely; therefore our current implementation aborts the MapReduce computation if the master fails. Clients can check for this condition and retry the MapReduce operation if they desire.

Re:2 words (Score:3, Insightful)

by WhiplashII ( 542766 ) writes: on Thursday August 23, 2007 @02:43PM (#20333773) Homepage Journal

Im afraid you are wrong. Here is how to describe the situation is statistical terms:

Probability of failure with a single drive: 1 in 1000

Probalitity of failure with ten drives: equals the probability of drive 1 failing or drive 2 failing or drive 3 failing or drive 4 failing or drive 5 failing or drive 6 failing or drive 7 failing or drive 8 failing or drive 9 failing or drive 10 failing.

The easier way to solve that equation is to reverse it - it equals the probabiblity of drive 1 not failing and drive 2 not failing and drive 3 not failing and drive 4 not failing and drive 5 not failing and drive 6 not failing and drive 7 not failing and drive 8 not failing and drive 9 not failing and drive 10 not failing. (Think about it, in order for the raid 0 to stay up, all the drives have to be good - calculate that probability).

Mathematically, this is 0.999^10, or 0.99. So the failure rate fell from 1/1000 to about 1/100, just as you would expect.

Re:Lots of Options (Score:4, Insightful)

by Wdomburg ( 141264 ) writes: on Thursday August 23, 2007 @03:22PM (#20334349)

Six: Use AJAX wherever you can. The response time for an AJAX function is amazing and it is really not that hard to do Basic AJAX.

AJAX can be a performance win. It can also be a nightmare if done poorly. I've seen far too many "web 2.0" applications that flood servers with tons of AJAX calls that return far too little data without a consideration for the cost (TCP connections aren't free, logging requests isn't free).

Response time is also variable. What feels 'amazing' local to the server can be annoyingly slow over an internet connection, especially if the design is particularly interactive.

Couple things I'd suggest:

1) Don't do usability testing on a LAN. An EV-DO card wouldn't be a bad choice for an individual. For a larger scale development environment a secondary internet connection works well.

2) Remember that a page can be dynamic without AJAX. Response time toggling the display property of an object is far more impressive than establishing a new network connection and fetching the data.

3) Isolate AJAX interfaces in their own virtual host so that you can use less verbose logging for API calls. This is a good idea for images as well.

Re:High availability!=high performance (Score:3, Insightful)

by fifedrum ( 611338 ) writes: on Thursday August 23, 2007 @04:52PM (#20335501) Journal

clustering, word up (if we're allowed to use old catch phrases like that)

disk reads and writes are the least of our troubles when we scaled much more than a small enterprise level of data. The sheer number of moving parts in our environment (not just physical parts, but bits flowing too) killed productivity and we wound up with the complete inability to cache anything.

There's simply too much data flowing back and forth to make caching pay for itself and too often will a hard drive fail requiring even more background noise.

Then God forbid you introduce a load-balancer into the equation because even if they are redundant, they aren't infalable, and they're still a network choke point which you will choke, guaranteed, before too long.

With horizontally scaled architectures, pretty soon you wind up with thousands/hundreds of disk drives (failures on a daily basis) and hundreds of power supplies, CPU coolers, sticks of RAM etc. Successfully scaling horizontally means overbuilding the living shit out of everything envolved to the point where you can handle a significant percentage of your environment going offline, and THEN being able to rebuild the missing parts with hot spares without interrupting your processing. On top of it, your servers have to be stateless with users tied to nothing but the storage backend which is so insanely overbuilt as to never lose a bit of data.

Or you pay through the nose for a real HA/HPC environment like a big iron box with a PB of DASD and 10 engineers who sleep with the box every night.

Anything less than that is a DR waiting to happen or, as we like to call it, compromise.

The problem with most architects building "scalable" environments is that they don't build scalable environments they build affordable environments.

Re:2 words (Score:3, Insightful)

by Hemogoblin ( 982564 ) writes: on Thursday August 23, 2007 @09:20PM (#20338635)

1/100 is larger than 1/1000. Therefore, it rose from 1/1000 to 1/100. That is, 10 single points of failure is more risky than 1 point. I believe that was what you were arguing anyway, so no worries.

Re:Languages, Libraries, Abstraction, Audience (Score:3, Insightful)

by EastCoastSurfer ( 310758 ) writes: on Thursday August 23, 2007 @11:40PM (#20339731)

Haha...actually I do dislike java and try to avoid it when I can. What I was really ripping on originally is how poor program performance is always looked at as a language issue first. Picking the right language for the job is of course important, but program design, algorithms used, and overall architecture will always be much bigger factors in a programs eventual performance.

Re:Lots of Options (Score:2, Insightful)

by smellotron ( 1039250 ) writes: on Friday August 24, 2007 @01:00AM (#20340189)

Then why are all these companies standardizing on Java? You think you're brighter than all these corporations? I doubt it. C++ is dying and I say good riddance.

Most companies standardize on Java not because it gives them a higher "maximum power", but because it gives them a higher "minimum power". The language as a whole is designed (whether intentionally or not) to reduce the damage done by less-than-stellar programmers, which is more important to a business than increasing the power of the superstar programmers.

Plus, you don't think Sun puts out all of the Java marketing buzz for nothing, do you? Java is sponsored by a corporation, and connections like that make a big difference to managers. While it may be a good thing for many companies to standardize on Java, they're not doing it because Java has an ounce of technical superiority.

I haven't done any personal testing, but I'm under the impression that speed-wise C/C++ stack > Java (everything is on the heap) > C/C++ heap. Since realtime Java code seems to encourage disabling garbage collection, that seems consistent. So yeah, you might be able to tweak out your speed by avoiding the heap in some cases, but I'm not sure. Maybe someone in the hard-real-time environment can enlighten us on that.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Learning High-Availability Server-Side Development? 207

Learning High-Availability Server-Side Development? More Login

Learning High-Availability Server-Side Development?

High availability!=high performance (Score:5, Insightful)

Not sure what you want to test. (Score:4, Insightful)

Re:2 words (Score:2, Insightful)

Slightly off topic (Score:3, Insightful)

The unfortunate thing about databases (Score:4, Insightful)

Availability Isn't Scalability (Score:3, Insightful)

Has anyone actually answered the question? (Score:3, Insightful)

Re:Languages, Libraries, Abstraction, Audience (Score:5, Insightful)

Re:HA is an IT thing (Score:2, Insightful)

No fallacy.... (Score:5, Insightful)

Re:2 words (Score:3, Insightful)

Re:Lots of Options (Score:4, Insightful)

Re:High availability!=high performance (Score:3, Insightful)

Re:2 words (Score:3, Insightful)

Re:Languages, Libraries, Abstraction, Audience (Score:3, Insightful)

Re:Lots of Options (Score:2, Insightful)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot