High Availability Solutions for Databases? 83
An anonymous reader asks: "What would be the best high availability solution for databases? I don't have enough money to afford Oracle RAC or any architecture that require an expensive SAN. What about open source solutions? MySQL cluster seems to be more master/slave and you can lose data when the master dies. What about this Sequoia project that seems good for PostgreSQL and other databases? Has anyone tried it? What HA solution do you use for your database?"
Whatever else they tell you (Score:2, Insightful)
What are you doing? It's important. (Score:5, Insightful)
For example, we run a site with data from a thousand odd different data sources, with each source getting updated every hour or so. We do it by parsing the data into static pages. We we receive a datum, we rebuild the pages that depend on it.
We have another site that runs off an Oracle db. the static page site runs about 90x faster, and is basically in memory (disk access is nil.) Now take into account that we can (and do) replicate the static page solution with zero load, we get to a solution that is literally 900x faster.
Now folks are thinking 'oh, the horror!' well... tough! There is no substitute for thinking about your data, and how it flows. A DB is not a given, but a (potentially wrong) answer to a question after you have done some analysis.
speaking from experience (Score:2, Insightful)
*sob*
Re:speaking from experience (Score:3, Insightful)
case in point:
We started off with HA, figured out how to go to cloned configuration: two servers, two RAIDS, no SPoF, right? We had some LAN issues which caused traffic storms, there was a bug in the controller logic, so both RAIDS crashed simultaneously. We fixed it by using another brand of RAID for one of the units. Those servers have not crashed since...
If you do the accounting, the biggest cause of lack of availability with HA sites is number 18, 18 inches in front of the keyboard. That's not because people are less skilled than before, it is because we have eliminated all the hardware issues, the stuff you don't automate is all the stuff that is too complicated to automate, so only human error in making complicated changes remains. So every down time, there is usually an analyst looking sheepish, but it is usually not his/her fault. The process had some failing in it, and you have to fix the process. It's a lot like I hear airliner crash investigations are like. Find out what happenned, fix the process, so that it doesn't happen again.
You hone it over years, and every failure or even glitch is precious. Study it.