Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
×
Databases Networking Programming Software

Distributed Versus Centralized DB? 47

OSXCPA asks: "I am developing a framework for processing accounting information. What kinds of advice would you have for setting up an environment where several (eventually, many) remote data aggregation and processing centers would update a central server 'after hours' to avoid processing delays. For those with ERP experience, I am trying to process transactions locally, instead of generating a batch in the central server, then update and validate the result to the central server, so my remote users can process at full speed all day, and central DB can get caught up later. Remote locations are all self-contained, in that no one will be posting transactions to another remote location. No big iron - this is all networked PC/Mac hardware and a central server (eventually to be a server farm, as needed). Feedback or constructive criticism would be appreciated. Technical advice is great, but I'm also looking for stuff like 'Gee, if I had only known XXX when I set this up, I would have done YYY...'"
This discussion has been archived. No new comments can be posted.

Distributed Versus Centralized DB?

Comments Filter:
  • by stefanlasiewski ( 63134 ) * <slashdotNO@SPAMstefanco.com> on Saturday July 16, 2005 @03:41PM (#13082778) Homepage Journal
    Whatever solution you choose, be sure to give each DB entry a unique identifier that is unique across systems, and make sure the identifier does not change (E.g. Don't use the customer's name, phone number or any other changeable information).

    One such solution: Each customer transaction has an internal ID number, like "Order #1000". Each system prepends it's own identifier to the order number, so order #1000 made on System 03 will have a UID of 03.1000. 03.1000 won't conflict with order #1000 made on system 02, because the latter UID is 02.1000.

    Using a unique internal UID also lets you keep all order history around. If th customer wants to cancel their order and retry, you simply mark 03.1000 as cancelled, and start a new order with 03.1001.

    This may seem obvious to you, but I can't tell you how many times I've seen a product fail because the development team decided to use the customer's name/phone number/address as the UID, which meant the customer couldn't change their name/phone number/address without breaking the index or other features in the DB. Not sure why, but this seems to be a common problem in databases.
    • I currently work with a data warehouse and while your purposes may differ, configuration of thses are similar. We have many transaction processing systems which feed the data warehouse with information. Ideally these systems would update the warehouse as they happen in the transaction systems instead of daily batch jobs; obviously a low-priority queueing mechanism so that increased load in the transaction system, network failures, etc., would not cause the transaction system to fail. I wish I worked with a
    • If I was going to do what you are saying, I would place the SystemID and OrderNumber in different fields and combine them in application logic if that is what is needed. It will make querying data easier on the database engine.
  • by Shag ( 3737 ) on Saturday July 16, 2005 @03:44PM (#13082802) Journal
    I am not a professional DBA, but I tend to wind up working with them. The one thing I remember from a big project where replication and whatnot came into play was that none of that stuff ever worked as well as the vendors said it would, and it all required a lot more work on the DBA's part (and typically lots more $$$ in general) than it was supposed to.
    • "...none of that stuff ever worked as well as the vendors said it would, and it all required a lot more work on the DBA's part (and typically lots more $$$ in general) than it was supposed to."

      So in other words it's a lot like any other kind of software you buy?

      • Well, hmm. Our DBA was easily top-5 in our state, president of the local 'Orrible users group, and so on. Backed up by some external conslutants who'd been working with the company for years before I even got there, and were total gurus.

        To this day, I don't know whether they ever got it to work.

        But yeah, similar things happened with other bits of software the company spent 7-figures on...

  • by itwerx ( 165526 ) on Saturday July 16, 2005 @03:55PM (#13082862) Homepage
    I did that awhile back for a data warehouse and reporting system with remote data collection points. (Half a TB of data with 50 million queries a day and a couple Gig to integrate every night). I did all the data validation and scrubbing and some preprocessing remotely to utilize the processing power of the collection machines as much as possible and kept the backend stored procedures strictly limited to import and reporting functions. The front end reporting also had some functions built into the client so I could harness the power of those machines as well. YMMV in that regard though depending on the architecture and intended purpose.

    Tip 1 - don't use any clustering software on the back end. Every clustering model out there tries to be generic enough to handle a variety of apps. This adds unnecessary overhead and complexity. If you design the app for a multiplicity of servers in the first place your performance and scalability will be much better! Not to mention everything will be a lot simpler which makes troubleshooting much easier! In my app I just had a table with a list of server names. To add a server you just loaded a blank database on it, plugged the machine in and put the name in that table. A few key tables (such as that one) were automatically replicated every few minutes and if the schema was missing on a server it got built and populated on the fly using self-generating scripts. (Very handy if you're making changes to the structure! :)

    Tip 2 - when importing drop your indexes! It's infinitely faster to bulk import into non-indexed tables and then rebuild the index once afterwards when you're ready for post-processing than it is to try to maintain the indexes for every little insert. (Literally about a thousand to one speed ratio if the data is any size at all).

    Tip 3 - like tip 2 this is more general database design advice. And that is to base your normalization on the nature of the data and the queries, not necessarily on the technical granularity. In other words a little bit of de-normalization is okay if it reduces the complexity of your queries enough. In our case we had an insane number of cases where many tables could be cross related in different ways. Based on the reporting we needed to do we purposely duplicated a few fields (mostly indexes) here and there between tables. This cut the memory utiliation of some of the more complex queries by up to 50%. When some of the joins can result in gigabyte matrices that's a significant savings!

    Tip 4 - try to spread the load. In the complex queries above we would of course try to break them down into smaller queries that could be run in sequence but that often resulted in the final report taking too long. The answer was to move the aggregation to the client. I.e. have the client pass the different parts of the report off to multiple servers to run in parallel and then pull the results together at the client side.

    That's all I can think of off the top of my head. If you want me to consult further just email me, I'm not terribly expensive. :)
    • And that is to base your normalization on the nature of the data and the queries, not necessarily on the technical granularity.

      Be careful with denormailzation - it can trigger a shitstorm of updates if one of the denormalized columns changes. Rather than changing one value in a \ table that is joined to another in a query, you may end up updating many millions of records in a denormalized table.

      It is ok to denormalize tables for query preformance, or report based tables but if you are changing data in th
    • I have been doing several large scale dwh projects, esp. in the telecommunications field. the previous post contains good advice. I would also suggest looking into a decent ETL toolkit. Esp. newer generation technology like datastage px and abinitio work well in distributed scenarios and are specifically built for this type of jobs. Also, in many cases, they are significantly faster for this type of workload. Of course, you might also want to look into real- or neartime solutions, esp. when you have large
    • Assuming you're building a custom system (as opposed to finding something off-the-shelf):

      Tip 5ish - Communication lines fail. Power lines go down at your remote sites. If something fails at the wrong time, your nightly process might not get the data in time to finish. I recommend a "trickle" approach, where the transaction processing software on your remote nodes hands off the resulting data to a separate "central update" program. (This can be as simple as setting a "complete for transfer" flag in your loc
  • Comment removed (Score:3, Insightful)

    by account_deleted ( 4530225 ) on Saturday July 16, 2005 @04:05PM (#13082927)
    Comment removed based on user account deletion
  • by richg74 ( 650636 ) on Saturday July 16, 2005 @04:06PM (#13082933) Homepage
    I've implemented a couple of database applications across multiple geographical locations (in "Wall Street" contexts), and I think the one thing that you must be sure of is that you really understand all the ways in which the data is used in and impacts the real-world business.

    You mention that this is for an accounting application. You suggest a scheme with local databases that periodically send updates to the central facility. If the books and records are in a strict hierarchy (each node is contained in exactly one "higher" node), this can work fine for some purposes, like financial reporting.

    In the situations I worked on, people initially agreed that this sort of setup would be fine. However, as we dug a little deeper, we realized it wouldn't work, because financial reporting wasn't the only real need. In the context of an investment bank, the capacity to take risk is a (finite) asset, and the amount of capacity that's available at any time is a function of all the trades that have been done anywhere -- since everything is ultimately reflected back to the parent's accounts. Had we not realized that, we would have built a fine system that we would have started rebuilding as soon as we'd finished it (cf. There's never time to do it right, but always time to do it over).

    I won't belabor things like redundancy, backups, and so on; I assume you have taken careful thought for those.

    Finally, IMO, the trickiest thing to get right in a distributed DB system is maintaining the referential and logical integrity of the database. I personally know of one well-known DB vendor's initial replication solution (names omitted to protect the guilty) that had a subtle but potentially very serious logical flaw. Moral: do your own homework.

  • by Anonymous Coward
    I can't stress this enough: the purpose of a database is to guarantee data integrity. Be sure you understand that: storing/retreving/manipulating data is part of it too, but without data integrity, you're lost.

    Unfortunately the state of the database industry (both vendors and users) is so fucked up right now I'm not even sure people know what "data integrity" means. I'll give you one definition: it means that every possible combination of values in your database has a legal business meaning. Does it makes
  • Not economiclal (Score:3, Informative)

    by duffbeer703 ( 177751 ) on Saturday July 16, 2005 @07:19PM (#13083913)
    We looked into doing something like this a year ago, and determined that with the cost of networking and clusters of cheap computers being the way they are, a centralized solution was far cheaper.

    My scenario was a distributed environment with approximately 3,000 remote sites. If you had less remote sites, the distributed processing may be cost effective, but you need to use extreme caution as distributed solutions can escalate in cost quickly.
  • by j()nty ( 741565 )
    I'm in the same situation, writing an application with about 500 users spread over 30 offices, a local database in each office, a central database, and overnight updates from the offices to the central database.

    I'm using globally-unique-id's for all records. These are generated by the database system I'm using and are guaranteed to be different across all the offices. No conflict between new records created at different offices. And all I have to do is use an existing feature from the database system.

  • by Anonymous Coward
    Since you mentioned this application is being rolled out to a bunch of PC's and Mac's, it sounds to me like you might be deploying to a large installation base where you do NOT have centralized management of the machines...

    At some point in the future, you probably will be upgrading the system to support new features / new data.

    You'll want to design the intefaces between the remote sites and "the mothership" so that as migrations are partially completed, the system continues to run for the older and the ne
  • AfterHours == Bad (Score:4, Informative)

    by iwadasn ( 742362 ) on Saturday July 16, 2005 @10:07PM (#13084595)

    Be very careful about trying to "fix everything" after hours. That's how we do some things (and did the same thing at my previous place), and it's a nightmare. If anything is wrong in the morning, then it's hard to go back and re-run the previous night's batch processing. Often it takes too long, often you would just corrupt data further, etc... Very often this results in us having to just have incorrect data for a day, and then get it right the next day. In addition, due to global expansion, we only have 2 hours each day in which to run this stuff. If a market were to stay open an extra little while, or we try to expand somewhere in the forbidden slot, I don't know what would become of us.

    You are far better off trying to get everything to be realtime, as much as possible, and build in mechanisms for the system to recover from corruption and incrementally do batching throughout the day. Do NOT rely on the fact that the system goes offline for 8 hours each night, that's an easy way to produce a system that costs you your job.

  • You are entering an area of database technology where angels fear to tread. This is something that if it were easy, we wouldn't be paying top DBA's hundreds of thousands of dollars to do it and keep it working with acceptable uptime. There is a reason why companies that run the scale of Amazon have IT expenses in the millions and millions, and why they do their darnedest to hire and keep the best talent in the country. You're entering the big leagues now. You're going to need a professional team to handle t
    • this is just fear-mongering, with no primary data to substantiate the claim that the task is difficult, only hearsay that there are people who spend a lot to do something that sounds, in a 2 sentence description, vaguely similar.
      from the original description, it sounded to me like the central DB was purely for mining purposes, to monitor what was going on, perhaps to learn from patterns in account activity. as such, the decoupled, do-it-yourself approach IS the right thing, and spending big bucks will do
    • It all comes down to exactly what the original poster meant by Remote locations are all self-contained, in that no one will be posting transactions to another remote location.

      Is the data self contained or the users self contained? If just the users I agree with you he's underestimating the difficulty. If the data then its pretty easy to do and I've done it before. He's not having to solve the remote data locking problem or consistency problem or, he just has to get his partition perfect.
    • If your manager can't agree to this, then you need to tell him this next part: "I cannot implement what you need unless you give me the tools and resources to do it...

      Be up-front, bold, and clear as heck with your manager, and don't try to hide anything from them.


      As an IT professional I certainly agree with your take on this, however I would take a different approach to informing my manager. Instead of saying "I cannot", I would couch the dilemna in terms of risk. Explain the difficulty of the problem
    • and you may decide you will need a mainframe - don't write it off too soon.

      Yes lets see what companies are surplusing there old 70's area equipment so we can implement a brand new system on outdated hardware. You do realize the reason mainframes are still in use today is because the software sitting on them was so expensive to devleop that is cheaper to maintain there then to write new code on a modern platform. There is virtualy no new Mainframe development just upkeep programming. If you need someth
  • Hi,

    in case you are working with an OO language (Java or C# ?), you may consider looking into exisiting frameworks that provide object-oriented replication.

    That would relieve you from building features like the following yourself:
    - Unique Universal Identifiers (UUIDs)
    - Cascaded replication with object member traversal
    - Conflict resolution at object level providing callbacks

    The following open source object database engine provides such functionality and you may even consider it as an alternative to relatio
  • I constantly have to remind myself as well 1. Find the simplest solution that could possibly work. 2. Design for testing. If you can't easily test and debug the system, then your design is bad. 3. Avoid distribution and threading if possible (see #2) That being said...there aren't simple solutions to all problems. Try to find a base to build from with source-code and a license that you can live with.
    • Good advice...

      I'd suggest adding some more basic points:
      4. Design for change, but don't go overboard. If new requirements or a sub-system design failure comes along, be ready to go to plan B without starting over.
      5. Design for performance auditing. When the system gets big, the question "why is X so slow?" WILL come up. It shouldn't be a 2 man-week project by senior staff to answer that question. After data integrity, performance is likely your biggest headache, especially if you add near-realtime funct

  • Distributed (Score:3, Insightful)

    by JoeCommodore ( 567479 ) <larry@portcommodore.com> on Sunday July 17, 2005 @04:52PM (#13088330) Homepage
    I've not been trained in building distributed systems but have learned on my own from the school of hard knocks.

    As metioned a good Record ID strategy is a must; I used a packed timestamp (base 60, using upper/lowr case numbers) with a node/user identifier, besides keeping the records unique you can also use the ID to track down where/when problems mikght have occurred.

    Also time stamps and delete logs!!! You need to know what data is new or old or should have been deleted.

    Work on a way to automate updates, you are going to be coding not for one computer but for all your nodes on the network, set something out there that can keep your data/apps in check and updated. A big stumbling block is getting the data transfer structure and protocol worked out.

    Look at the importance and frequency of data and make sure your system can handle it, if nightly is OK, then it's fine, though there might be times when your satte;lites might be down, don't paint yourself into a corner by expecing every transfer to work right every time.

    Choose your apps wisely, since you mentioned Macs that narrows your choice of what you can use. We had used FoxBase +/Mac, but 4th dimension is cross-platform and looks real flexible. It could be done using LAMP at the nodes, but you will have to work out your own distributed data import/export scripts. (DBF xfer was sooo eay)

    I find you have to implement a lot more remote validation, as you expect your data to be clean when it comes in from the other offfices.

    Also things happen to different sites, and they may have to restore from backups etc. Work out how to re-sync the sites when such events occur.

    You are looking at a very complex system to manage. (those thousand or so lines of spaghetti code will keep you employed)

    Normalize and clean up your database as much as possible now include planned data element (even if you currently hide it for now), making structure changes when distributed can be a real pain.

    With that said, after years of being distributed; technology here in our rural community now makes it possible to centralize, and I'm relieved that I can make the switch. (though I may keep bits distibuted, field laptops, and such.) It's a good skill to learn, adds a whole new level of paranoia when you're coding. :-)
  • by Kefaa ( 76147 ) on Monday July 18, 2005 @11:47AM (#13094231)
    Here is my short list based on real life experiences:

    *Read Been There, Done that above.
    *If your system is geographically dispersed there is no night time. If you can handle the transactions at night, then handle them "near realtime" by sending the transactions into a queue to be processed when bandwidth, CPU, etc. are available.
    *Assume the Central database will not respond for three days (a tornado, etc. at the site) and see what your model will do versus what the business wants.
    *Unplug the network during a transfer to the Central site. What do you expect it to do, when an admin pulls the cable for a second to clean behind the server.
    *Identify units of work. Those transactions the business believes must occur together. These are often different than DB units of work.
    *Expect bad data. People will tell you that it does not happen, but if you send transactions via a secondary process, it will. A record that was good on the remote box will be unusable on the Central or vise versa. Now what should happen?
    *Compress all transactions being sent to/fro. While some will be tiny and hardly seem worth it, a consistent approach makes life at 3:00 a.m. a lot easier.
    * XML is your best friend and worst enemy. If you choose to use it, know why it is good for your application.
    *Avoid any product, template, approach or design built for the future. If you are not going to use it in the initial production roll-out. Stay away. If you are that good at reading the future, by lottery tickets. Unless this is a very small/fast project your options twelve months from now will be very different and you will have spent the time/money/energy building a solution without a use.
    *Consider what areas are most likely to change, based on past business experience, and make them easy to change. (No, this does not conflict with the previous bullet
    *New tools and additional bodies will not help when you are late. Add more people early and your life will be much easier. Rapid Development by S. McConnell is still the bible for PM.
    *Create a lab where you connect the Centeral DB to a set of remote servers (also in the lab). They should be blazing fast as they are connected locally, and give you the ability to create "what if" cases in-house.
    *Create a step by step process for rebuilding a server from scratch. From the time the hardware is turned on, to the time the server can be released. This is a painful experience and will take a week or better full time, but will make every additional site easy to implement. Have a standard for what is dynamic based on the server. (i.e. Step 26: Change ##siteName## to local host name) so people type "www.abcCompany.com" instead of siteName. Then have someone not involved in the documentation try it.

    Hope this helps.
    Cheers.
    • Okay, thanks for these - this was one of the most useful posts I received for my original question.

      One follow-on question, why do you say "XML is your best friend and worst enemy" ? I had planned on using XML to move data into, out of, between and across program elements. I also have just started working with XML, so your comment filled me with some dread.

      Thanks for your insight -
  • by ecklesweb ( 713901 ) on Monday July 18, 2005 @02:28PM (#13096136)
    I'm in a situation where we have a few thousand retail locations that send data up at night. A few points of pain:

    1. Use an extensible format for your batch uploads.
    What you think you want to upload today will be different (probably less than) what you want to upload tomorrow. Use an extensible format for your batch uploads (XML is awfully nice in this regard, it just eats up bandwidth). That way when you make upgrades to part of the chain, your server software can gracefully handle differences in the data being sent up from different locations. If you try to "bleed the turnip" in terms of bandwidth and go with something like a fixed width or comma separated file, you're going to pay for it in maintenance down the road.

    2. Keep a backup of the data at the remote location until the remote location receives a well-formed acknowledgement form the central location. Just because the remote location sends up the data doesn't mean it's captured; a lot of things can go wrong, and it makes life much easier if you can recover the data at the remote location later. Furthermore, PLAN to miss some uploads; it's just going to happen sometimes, and if your system is designed to handle that fact, you'll be much better off.

    3. Document your data formats - the format at the remote location and the central location (presumably those are relational databases and can be documented with detailed ER diagrams) and the format used for transmission. Troubleshooting is a pain in the ass if you are unsure about how to map a transmission field to a remote or central DB field. Seems like a no-brainer, but if you don't do it from the beginning you're not ever going to do it.

    4. You might want to give your remote locations random upload times within the overall upload window so as not to "slashdot" yourself. If you want the remote locations to upload between 1 AM and 4 AM, write a script and schedule it in cron that will trigger the upload at some random time during that period. If all your remote locations start uploading at 01:00:00.00, you're going to have trouble scaling.

    This kind of thing isn't rocket science, but once you start down the wrong path, it's easy to get entrenched to the point where it's easier to live with your mistakes than to take corrective action.
  • I work with a retail POS system that pushes around updates at night. I've noticed that there is a problem in knowing whether or not the remote servers are up to date or not. So, for a new development, I would recommend that part of your update scripts send some sort of table statistics like
    count(*) where date > yesterday
    so that you have some idea if everything is in sync. Along a similar vein, mark records as "dead" somehow rather than try and push around record deletions -- it's a lot easier to tro
  • I say that 'standard' databases are 'deathtraps' for the simple reason that the data is stored in a (usually) proprietary way that cannot be easily reversed engineered in case of a database corruption by way of hardware failure (a hard disk going bad) or some sort of software failure (a virus infected the affected software, causing it to introduce corrupt data records to the database).

    Why not use structured, totally readable flat ASCII files (meaning all binary numbers are stored as bulky, readable ASCII s
    • This was actually my first thought. Then I thought of using an XML file. Then I thought "I'm new at this - there has to be some reason NOT to do that. Post to /. and see what people with more experience think."

      I like your idea - my question, does your suggestion come from field work, or are you, like me, looking for a better solution yourself, and may be, like me, unaware of the drawbacks?

      Thanks for the reply, BTW.

Love may laugh at locksmiths, but he has a profound respect for money bags. -- Sidney Paternoster, "The Folly of the Wise"

Working...