Distributed Versus Centralized DB?

Distributed Versus Centralized DB? 47

Posted by Cliff on Saturday July 16, 2005 @03:30PM from the updates-without-terrible-bottlenecks dept.

OSXCPA asks: "I am developing a framework for processing accounting information. What kinds of advice would you have for setting up an environment where several (eventually, many) remote data aggregation and processing centers would update a central server 'after hours' to avoid processing delays. For those with ERP experience, I am trying to process transactions locally, instead of generating a batch in the central server, then update and validate the result to the central server, so my remote users can process at full speed all day, and central DB can get caught up later. Remote locations are all self-contained, in that no one will be posting transactions to another remote location. No big iron - this is all networked PC/Mac hardware and a central server (eventually to be a server farm, as needed). Feedback or constructive criticism would be appreciated. Technical advice is great, but I'm also looking for stuff like 'Gee, if I had only known XXX when I set this up, I would have done YYY...'"

Distributed Versus Centralized DB?

This discussion has been archived. No new comments can be posted.

Search 47 Comments Log In/Create an Account

Comments Filter:

Use unique identifiers... (Score:3, Informative)

by stefanlasiewski ( 63134 ) * writes: <(moc.ocnafets) (ta) (todhsals)> on Saturday July 16, 2005 @03:41PM (#13082778) Homepage Journal

Whatever solution you choose, be sure to give each DB entry a unique identifier that is unique across systems, and make sure the identifier does not change (E.g. Don't use the customer's name, phone number or any other changeable information).

One such solution: Each customer transaction has an internal ID number, like "Order #1000". Each system prepends it's own identifier to the order number, so order #1000 made on System 03 will have a UID of 03.1000. 03.1000 won't conflict with order #1000 made on system 02, because the latter UID is 02.1000.

Using a unique internal UID also lets you keep all order history around. If th customer wants to cancel their order and retry, you simply mark 03.1000 as cancelled, and start a new order with 03.1001.

This may seem obvious to you, but I can't tell you how many times I've seen a product fail because the development team decided to use the customer's name/phone number/address as the UID, which meant the customer couldn't change their name/phone number/address without breaking the index or other features in the DB. Not sure why, but this seems to be a common problem in databases.

Been there, done that! (Score:4, Informative)

by itwerx ( 165526 ) writes: on Saturday July 16, 2005 @03:55PM (#13082862) Homepage

I did that awhile back for a data warehouse and reporting system with remote data collection points. (Half a TB of data with 50 million queries a day and a couple Gig to integrate every night). I did all the data validation and scrubbing and some preprocessing remotely to utilize the processing power of the collection machines as much as possible and kept the backend stored procedures strictly limited to import and reporting functions. The front end reporting also had some functions built into the client so I could harness the power of those machines as well. YMMV in that regard though depending on the architecture and intended purpose.

Tip 1 - don't use any clustering software on the back end. Every clustering model out there tries to be generic enough to handle a variety of apps. This adds unnecessary overhead and complexity. If you design the app for a multiplicity of servers in the first place your performance and scalability will be much better! Not to mention everything will be a lot simpler which makes troubleshooting much easier! In my app I just had a table with a list of server names. To add a server you just loaded a blank database on it, plugged the machine in and put the name in that table. A few key tables (such as that one) were automatically replicated every few minutes and if the schema was missing on a server it got built and populated on the fly using self-generating scripts. (Very handy if you're making changes to the structure! :)

Tip 2 - when importing drop your indexes! It's infinitely faster to bulk import into non-indexed tables and then rebuild the index once afterwards when you're ready for post-processing than it is to try to maintain the indexes for every little insert. (Literally about a thousand to one speed ratio if the data is any size at all).

Tip 3 - like tip 2 this is more general database design advice. And that is to base your normalization on the nature of the data and the queries, not necessarily on the technical granularity. In other words a little bit of de-normalization is okay if it reduces the complexity of your queries enough. In our case we had an insane number of cases where many tables could be cross related in different ways. Based on the reporting we needed to do we purposely duplicated a few fields (mostly indexes) here and there between tables. This cut the memory utiliation of some of the more complex queries by up to 50%. When some of the joins can result in gigabyte matrices that's a significant savings!

Tip 4 - try to spread the load. In the complex queries above we would of course try to break them down into smaller queries that could be run in sequence but that often resulted in the final report taking too long. The answer was to move the aggregation to the client. I.e. have the client pass the different parts of the report off to multiple servers to run in parallel and then pull the results together at the client side.

That's all I can think of off the top of my head. If you want me to consult further just email me, I'm not terribly expensive. :)

The most crucial thing (Score:3, Informative)

by richg74 ( 650636 ) writes: on Saturday July 16, 2005 @04:06PM (#13082933) Homepage

I've implemented a couple of database applications across multiple geographical locations (in "Wall Street" contexts), and I think the one thing that you must be sure of is that you really understand all the ways in which the data is used in and impacts the real-world business.
You mention that this is for an accounting application. You suggest a scheme with local databases that periodically send updates to the central facility. If the books and records are in a strict hierarchy (each node is contained in exactly one "higher" node), this can work fine for some purposes, like financial reporting.

In the situations I worked on, people initially agreed that this sort of setup would be fine. However, as we dug a little deeper, we realized it wouldn't work, because financial reporting wasn't the only real need. In the context of an investment bank, the capacity to take risk is a (finite) asset, and the amount of capacity that's available at any time is a function of all the trades that have been done anywhere -- since everything is ultimately reflected back to the parent's accounts. Had we not realized that, we would have built a fine system that we would have started rebuilding as soon as we'd finished it (cf. There's never time to do it right, but always time to do it over).

I won't belabor things like redundancy, backups, and so on; I assume you have taken careful thought for those.

Finally, IMO, the trickiest thing to get right in a distributed DB system is maintaining the referential and logical integrity of the database. I personally know of one well-known DB vendor's initial replication solution (names omitted to protect the guilty) that had a subtle but potentially very serious logical flaw. Moral: do your own homework.

Not economiclal (Score:3, Informative)

by duffbeer703 ( 177751 ) writes: on Saturday July 16, 2005 @07:19PM (#13083913)

We looked into doing something like this a year ago, and determined that with the cost of networking and clusters of cheap computers being the way they are, a centralized solution was far cheaper.

My scenario was a distributed environment with approximately 3,000 remote sites. If you had less remote sites, the distributed processing may be cost effective, but you need to use extreme caution as distributed solutions can escalate in cost quickly.

AfterHours == Bad (Score:4, Informative)

by iwadasn ( 742362 ) writes: on Saturday July 16, 2005 @10:07PM (#13084595)

Be very careful about trying to "fix everything" after hours. That's how we do some things (and did the same thing at my previous place), and it's a nightmare. If anything is wrong in the morning, then it's hard to go back and re-run the previous night's batch processing. Often it takes too long, often you would just corrupt data further, etc... Very often this results in us having to just have incorrect data for a day, and then get it right the next day. In addition, due to global expansion, we only have 2 hours each day in which to run this stuff. If a market were to stay open an extra little while, or we try to expand somewhere in the forbidden slot, I don't know what would become of us.

You are far better off trying to get everything to be realtime, as much as possible, and build in mechanisms for the system to recover from corruption and incrementally do batching throughout the day. Do NOT rely on the fact that the system goes offline for 8 hours each night, that's an easy way to produce a system that costs you your job.

A list of considerations (Score:3, Informative)

by Kefaa ( 76147 ) writes: on Monday July 18, 2005 @11:47AM (#13094231)

Here is my short list based on real life experiences:

*Read Been There, Done that above.
*If your system is geographically dispersed there is no night time. If you can handle the transactions at night, then handle them "near realtime" by sending the transactions into a queue to be processed when bandwidth, CPU, etc. are available.
*Assume the Central database will not respond for three days (a tornado, etc. at the site) and see what your model will do versus what the business wants.
*Unplug the network during a transfer to the Central site. What do you expect it to do, when an admin pulls the cable for a second to clean behind the server.
*Identify units of work. Those transactions the business believes must occur together. These are often different than DB units of work.
*Expect bad data. People will tell you that it does not happen, but if you send transactions via a secondary process, it will. A record that was good on the remote box will be unusable on the Central or vise versa. Now what should happen?
*Compress all transactions being sent to/fro. While some will be tiny and hardly seem worth it, a consistent approach makes life at 3:00 a.m. a lot easier.
* XML is your best friend and worst enemy. If you choose to use it, know why it is good for your application.
*Avoid any product, template, approach or design built for the future. If you are not going to use it in the initial production roll-out. Stay away. If you are that good at reading the future, by lottery tickets. Unless this is a very small/fast project your options twelve months from now will be very different and you will have spent the time/money/energy building a solution without a use.
*Consider what areas are most likely to change, based on past business experience, and make them easy to change. (No, this does not conflict with the previous bullet
*New tools and additional bodies will not help when you are late. Add more people early and your life will be much easier. Rapid Development by S. McConnell is still the bible for PM.
*Create a lab where you connect the Centeral DB to a set of remote servers (also in the lab). They should be blazing fast as they are connected locally, and give you the ability to create "what if" cases in-house.
*Create a step by step process for rebuilding a server from scratch. From the time the hardware is turned on, to the time the server can be released. This is a painful experience and will take a week or better full time, but will make every additional site easy to implement. Have a standard for what is dynamic based on the server. (i.e. Step 26: Change ##siteName## to local host name) so people type "www.abcCompany.com" instead of siteName. Then have someone not involved in the documentation try it.

Hope this helps.
Cheers.

some retail experience... (Score:3, Informative)

by ecklesweb ( 713901 ) writes: on Monday July 18, 2005 @02:28PM (#13096136)

I'm in a situation where we have a few thousand retail locations that send data up at night. A few points of pain:

1. Use an extensible format for your batch uploads.
What you think you want to upload today will be different (probably less than) what you want to upload tomorrow. Use an extensible format for your batch uploads (XML is awfully nice in this regard, it just eats up bandwidth). That way when you make upgrades to part of the chain, your server software can gracefully handle differences in the data being sent up from different locations. If you try to "bleed the turnip" in terms of bandwidth and go with something like a fixed width or comma separated file, you're going to pay for it in maintenance down the road.

2. Keep a backup of the data at the remote location until the remote location receives a well-formed acknowledgement form the central location. Just because the remote location sends up the data doesn't mean it's captured; a lot of things can go wrong, and it makes life much easier if you can recover the data at the remote location later. Furthermore, PLAN to miss some uploads; it's just going to happen sometimes, and if your system is designed to handle that fact, you'll be much better off.

3. Document your data formats - the format at the remote location and the central location (presumably those are relational databases and can be documented with detailed ER diagrams) and the format used for transmission. Troubleshooting is a pain in the ass if you are unsure about how to map a transmission field to a remote or central DB field. Seems like a no-brainer, but if you don't do it from the beginning you're not ever going to do it.

4. You might want to give your remote locations random upload times within the overall upload window so as not to "slashdot" yourself. If you want the remote locations to upload between 1 AM and 4 AM, write a script and schedule it in cron that will trigger the upload at some random time during that period. If all your remote locations start uploading at 01:00:00.00, you're going to have trouble scaling.

This kind of thing isn't rocket science, but once you start down the wrong path, it's easy to get entrenched to the point where it's easier to live with your mistakes than to take corrective action.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Distributed Versus Centralized DB? 47

Distributed Versus Centralized DB? More Login

Distributed Versus Centralized DB?

Use unique identifiers... (Score:3, Informative)

Been there, done that! (Score:4, Informative)

The most crucial thing (Score:3, Informative)

Not economiclal (Score:3, Informative)

AfterHours == Bad (Score:4, Informative)

A list of considerations (Score:3, Informative)

some retail experience... (Score:3, Informative)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot