Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
×
Programming IT Technology

When is Database Muscle Too Much? 93

DBOrNotDB asks: "At some of the places I've worked in the past, there have been DBAs who generally insisted that given accurate specifications and enough hardware and software, you could stuff nearly anything you wanted to into a database, manipulate it, and pull it back out again in a reasonable time. The feeling at my current workplace seems to be that very few projects lend themselves to database usage and that a customized one-off data storage solution should be developed for each project. This seems like a violation of many major software engineering principals (e.g. reuse) to me. My question is, what kind of success or horror stories does the community have about trying put different projects into databases? Numbers (# of rows, tables, total data storage, cost, etc) would be nice, but even just anecdotes would be helpful."
This discussion has been archived. No new comments can be posted.

When is Database Muscle Too Much?

Comments Filter:
  • by dacarr ( 562277 ) on Tuesday October 29, 2002 @05:41PM (#4559191) Homepage Journal
    Slightly off topic as it were, but I've noticed that a lot of people seem to think that Excel works very nicely as a database. In some cases this might be true, but the bigger you get the more problems you have, and I just can't seem to convince those of a less-than-technical mind (read: management) otherwise.
    • My company is the same way! EVERYTHING is in an Excel spreadsheet, its absolutely ridiculous. After showing them what Access was and how it wasn't too difficult to use, they thought I was brilliant. Of course, a few key people thought it was too "complicated" and back to spreadsheets we go
    • by leviramsey ( 248057 ) on Tuesday October 29, 2002 @06:02PM (#4559370) Journal

      Doesn't Google run off of a huge Excel spreadsheet?

    • I used to work at the pension co for a big name company. they had an excel spreadsheet for each person. Each spreadsheet was 9 MB compressed with winzip, and something like 15 MB or so uncompressed, I think. I thought they were sorta stupid for not using databases, too, and it seems like I was right.
    • In my experience this occurs in outfits where the people who hold the purse strings are not also the people who understand computers. They may have no experience of what real systems can do for them.
      I used to work at a place where mission critical data was stored in a massive Paradox database to which all these crazy jarheads had admin rights. Just stupid. You ended up with thousands of copies of tables and no-one knew which one was the right one to use. Then they decided to go SQL Server and what did they do? Remodel the data structure? No. They just copied the damn Paradox tables one for one.
      They used to back up the c:\ drives of every machine in the office every lunchtime in that place.
      Dickheads.
    • Well, actually the ability of Excel to act as a primitive database was one of the reasons it has wound up being so popular.

      Joel Spolsky wored with MS on the Excel team and points out that in user studies they did the ability of excel to record data in such a way was important in it's adoptance. Check out the chapter [joelonsoftware.com] from his excellent book User Interface Design for programmers and search for excel.

    • but I've noticed that a lot of people seem to think that Excel works very nicely as a database.

      Just the opposite of what we have here. Folks here seem to think that Access is just a wonderful spreadsheet program. (4 databases, 1 table each, 50+ columns)
  • One company I worked for contracted out an application we had to build to a provider who brought in this crazy, dishevled, brilliant Russian database engineer. I remember that with every issue that would come up he would say, "Is database application" and go off muttering to himself. Content management software needed? "Is database application." File system problems? "Is database application." new mouse drivers? "Is database application." What to order for lunch? "Is database application."

    The moral of the story? Any computer application is built most quickly and easily using a database solution.
    • "When all you have is a hammer, everything looks like a nail"
    • It has less to do with databases, than with tables. A database is just a very structured and very good way to handle tables.

      If you can describe things in tables, put it in a database.

      I think that the main reason that people use Excel, is because you need no programmer in between. If you want to use a database, then you need to create new interfaces for every table and view that you add. This asks for someone who understands databases and interface design, and knows how to get information out of people, etc...

      Btw, isn't the S/400 filesystem a DB/2 database ?

      • "Btw, isn't the S/400 filesystem a DB/2 database ?"

        Well, to oversimplify, yes. But only part of it.

        AS400 / I-Series computers actually have multiple file systems, arranged heiraraclly under what they call an "Integrated File System". Some newer file systems will hold any old blob of data in a file, whereas the original file system is basically inherited from the old S38 library system.

        Basically, under ROOT there's a branch that follows OS2 or NT file naming conventions, under QOPENSYS a branch that emulates Unix file naming system, etc. These have directories and files, as you'd expect in most OS filesystems. But the QSYS.LIB is special. It's a library, not a directory. On the AS400 libraries are quite a bit more structured. Among other things, a library can act as a database "collection".

        Some objects in the libraries are DB2 tables, views, indexes, and transaction journals. Others are stored procedures, queries and application programs.

        All other libraries live in QSYS.LIB, and no other library can contain another library.

        Now, since the S38 predated the DB2 product, and for that matter SQL, there are many database tools on an AS400 that no other DB2 platform has. And these tools are useful. There used to be a couple of logical file tricks (indexed views) you can do with DDS that SQL didn't do for example. But most tools from the mainframe and unix version have been ported to the I Series.

        Neat boxes, those AS400. Single source as all hell though.
  • afaik... (Score:3, Interesting)

    by pizza_milkshake ( 580452 ) on Tuesday October 29, 2002 @05:52PM (#4559269)
    i used to be a big php fan (still a fan, but not a big one) and i was always surprised/dismayed that everyone wanted to store images for web-based applications (catalog-style images, user-submitted icons/graphics, banners, etc) in a database.

    i was always under the impression that the filesystem was a better place for this, assuming the directory structure was simple or fixed (i.e. you wouldn't be creating thousands of subdirs dynamically). why store all your banners in a table as BLOBs when you can simply have a web-accessible directory and store them there?

    i never really found a clear reason on which was better performance-wise, though i suspect the filesystem-based way is. i also found it to be less of a hassle to implement. any intelligent thoughts?

    • I've heard anecdotal evidence saying that the filesystem is faster and less processor intensive than storing binary data in the database. That said, I usually store images in my databases. :-)
    • Filesystem is faster. Mysql's page about optimization also says this. I guess database makers don't want you to know this because they want their database used more.
    • Re:afaik... (Score:4, Insightful)

      by Xunker ( 6905 ) on Tuesday October 29, 2002 @06:12PM (#4559455) Homepage Journal
      I'm running a moderately large site that deals heavily with images and at first I thought it would be great store all the images in the database along with the incidentals. Happily I had pause for thought because the site went big and decided to go with file system storage for the images instead, and I'm glad I did:

      * Size: I didn't anticipate the user would upload 5 gig of images
      * Access: You need some sort of extraneous code to pull those images from the database
      * Communication: your code must know how to fake being a proper image (right headers, creation/change times, etc)
      * Size 2: A lot of databases can't store high-res images (read: large images) in a database without serious penalties (like chopping them up into little bits to fit into a MySQL bigblob)

      Speed can be addressed in the perl world via FastCGI or Mod_perl and similar ways on other platforms, but you'll still have to do disk reads to get the data and you'll pay a price for having one more often used script in memory.

      Storage size can be counteracted with clever tricks or "professional grade" (read: expensive) DB Engines like Oracle and DB2 which have binary data storage as one of their features, but you'll need to pay money and have a big muscly machine to run 'em.

      For my money, 90% image serving can and should be done from filesystem because that's what web servers are made for. The other 10% are weird meta things like this [picpix.com] that could use the file system but are designed to use database.

      My big thing is using your head and ask yourself what will be easier in the long run? Sure, binary data slung around with Perl DBI sounds convenient, but how conevneint is it to run "ismcheck" on a 100 GB database, eh?
      • Huh? As far as I know, picpix stores the images on the filesystem (and not in the database). The database is used to track/manage the images, though..

        Same goes for similar applications like Gallery [sourceforge.net], which newest version (v2.0, in progress) will use SQL to drive the backend, and will rely on your filesystem/webserver to serve the images.
        • Ah, I could've been wrong - I haven't followed fotobilder dev for quite a while and I know in the early early days a pure database version was on the table but it makes sense, with the amount of images it's designed to store, that it uses the file system for the actually binary data.
      • Happily I had pause for thought because the site went big and decided to go with file system storage for the images instead, and I'm glad I did

        Why not do both? Oracle has a datatype called BFILE, in which the actual data is stored in the filesystem, and the row in the table contains a pointer to it. You have the best of both worlds, filesystem access to the image if you want it, or database access, and you can very easily integrate the image with the rest of your relational data.
        • Re:afaik... (Score:2, Funny)

          by Xunker ( 6905 )
          Ahem.. you assume I can AFFORD Oracle! I had to sell a kidney to buy winter tires for my car this year!
    • Re:afaik... (Score:2, Interesting)

      Yes, this is definitely bad. Images are typically stored in the db as Binary Large OBjects (BLOBs) which no database system on the planet is good at retrieving quickly. Also, updating into the database can take longer.

      The real killer in database performance comes in two places, large complex joins and full table scans. Eliminate these two things in your db and you should never have scaling problems. To do this, watch your long running queries and make sure they have the proper indexes on the tables. And make sure you keep your statistics updated. But please, for the love of God, don't listen to those moron college profs who say normalization is key. That's all good and fun until you have a million records in each of four tables you need to join to provide a solution. That, or your data model charts take up 300 sq. ft of wall space.
      • Re:afaik... (Score:4, Informative)

        by Rick the Red ( 307103 ) <Rick.The.Red@gma i l .com> on Tuesday October 29, 2002 @06:31PM (#4559621) Journal
        But please, for the love of God, don't listen to those "real world experience" morons who say denormalization is key. That's all well and good until you've got the same information replicated to hell and gone and records start to disagree: Is the billing address 123 North Main or 1313 Mockingbird Lane? Half the invoices for this customer show one, half the other.

        If your normalized tables take a performance hit, buy a bigger box. If you munge the data with replication, you're screwed.

        • Re:afaik... (Score:3, Insightful)

          by GLHMarmot ( 124846 )
          But please, for the love of INSERT DIETY, don't listen to those polarized-my-way-is-the-only-way INSERT INSULT's.

          There is always a case for normalization or denormalization. I have developed many different databases of various sizes. ( 10 terabytes) and as a rule I try to be a s normalized as possible. I have worked on some data conversions that were rathy messy due to denormalization. However, I can't think of a single database where there wasn't some type of denormalization for various reasons. From speeding query results to the client's demands.

          Yes, I could have recommended that my clients buy bigger hardware but when a summary table can be used instead of spending $X thousand dollars, what do you think you would do?

          • Yes, I could have recommended that my clients buy bigger hardware but when a summary table can be used instead of spending $X thousand dollars, what do you think you would do?


            I would use an Oracle Materialized View and get the speed increase without the extra dollars and without needing the extra hardware. Work smarter, not harder.

        • Sometimes, well-thought out denormalisation can make a huge, positive impact on an application. Yes, it can be difficult to make sure you don't have any anomalies in your data, but with a rigourous design and development methodology, these problems can be minimised.

          Don't denormalise for the sake of denormalising - the trick is to know when to break the rules and to do so very carefully.

          Denormalisation is only one tool that can be used to improve the performance of a system, and of course, other options like more memory, faster CPUs and better code should be addressed first.
        • I work on a program with a quite denormalized database. The thing is simple. To retrieve some kinds of data you need to do some fairly long SELECTs, so that data is duplicated. This is things like the last vendor who shipped a product, and the last date of shipment, for example. Individually this isn't that slow, of course, but we need to generate this information for every product sometimes. All this data can be easily recalculated, and I have a program exactly for that. So really there's no consistency issue unless there's a bug, and a bug fix and a pass of the program should take care of that.
      • But please, for the love of God, don't listen to those moron college profs who say normalization is key. That's all good and fun until you have a million records in each of four tables you need to join to provide a solution. That, or your data model charts take up 300 sq. ft of wall space.

        Good DBMSes can break complex joins catching the criteria piece by piece. You can also create run time extracts which are used by real time / almost real time systems for read access. However your advice is simply terrible. Once you lose normal form you lose the associative law on your table algebra. That means join operations are not defined independently of order they are performed in and that is very bad. Rick mentions an example of this in terms of addresses but it can get far worse.

      • Why are there so much anti-university posts on slashdot?
        In our database course we were taught very clearly why and why not to normalise, and how much.

        Are US universities really so bad as most /.'ers make out? Move to england if so :)

        I think the reason is that you have
        a) The person who is good at computers, so skips uni and goes to work
        b) The person who isn't that good, goes to uni then goes to work
        c) The person who is good at computers, goes to uni, and then goes to work.

        It seems ppl in group A keep meeting ppl in group B. Perhaps this might be because ppl in group C work in jobs that ppl in group A don't, so they never meet.

        Thoughts?
      • But please, for the love of God, don't listen to those moron college profs who say normalization is key. That's all good and fun until you have a million records in each of four tables you need to join to provide a solution. That, or your data model charts take up 300 sq. ft of wall space.

        Don't normalise your database at your own peril!

        Learn how to properly normalise a database (3NF is usually good enough) and then learn how to write decent queries and tune your indexes appropriately. In some circumstances it may be worthwhile caching some data in extra tables which are refreshed periodically. In even rarer circumstances it may be necessary to denormalise the database, but always normalise it and only denormalise if you encounter some intractible performance issues.

        We have some incredibly complex queries at my current client and we have managed to gain performance improvements by re-writing certain queries. Splitting complex queries up and using temp tables, derived tables, sub-queries, etc can help and you will have less of a problem with data anomalies which can creep into a denormalised structure.
        • Amen. What the parent poster didnt realize is that when youre dealing with 3 tables, one of which has 52 million records in it, joins are the non-option. 4 day queries are not permitted. We get around it denormalizing to certain degrees (certainly not rigorously all the way to 3NF), using tmp tables, being as religious as possible about indexing .. all with mysql, baby.

          > a million records

          Thats chump change! Try 50 million and then we'll talk database ;)
          • What the parent poster didnt realize is that when youre dealing with 3 tables, one of which has 52 million records in it, joins are the non-option. 4 day queries are not permitted.

            Just a quick check of one our systems shows a table with 1.5 million records. Not a large table, but not trivial either. Due to the nature of the data and the structure of the database, it is necessary to execute queries containing 10-12 self-joins routinely. One report requires 16 joins in total. The slowest of these queries takes about 30 seconds, most are well under 2 seconds.

            Sure, we could denormalise our data to make the queries simpler (multi-table joins scare DBAs for some reason) but we would lose a lot of the flexibility that our design affords us. We have no need to denormalise this database because we wrote more efficient queries and have paid careful attention to our indexes.

            The largest table in our production database is under 10 million rows. We have simulated much larger sizes in our test environments though, and found our joins are not a problem. The performance with 10 times the data is still well within acceptable range for our requirements. (Most stored procedures execute under 5 seconds with this size.) If you have problems with a three table join, head back to school and learn how to do it properly! ;-)

            You say you are as religious as possible about indexing, but what about the structure of your queries? Are you indexing the right columns? You are using temp tables? Try getting rid of them - in many cases they slow things down. Make sure your queries are sargable. Make sure you are limiting the intermediate result-sets to be as small as possible. Understand the query execution plan. Make sure your server has enough memory! Are your server's settings tuned for your particular application? There are many, meny things to pay attention to, but you should not be having any problems with a 52 million row table in a three-table join!

            When you start doing real joins with 10 or more tables (each with >1 million rows) then we'll talk about how to denormalise the data to improve performance.

            8-)
          • Been there, done that. Joins aren't a problem with the right indices.
      • But please, for the love of God, don't listen to those moron college profs who say normalization is key.

        This is a horrible approach to analyzing any problem. Normailized and denormalized data both have their place in today's RDBMS-driven world. Take any one of the large ERP packages available today: Oracle Financials, PeopleSoft, SAP, etc.

        When working with specific data, for example Accounts Payable data, you really don't want to duplicate all of that customer data again and again and again for each row in the database, hence you normalize it. Yes, you pay a bit of a speed penalty when joining against the CUSTOMERS, CUSTOMER_ADDRESSES , INVOICES and INVOICE_LINES. In reality, that difference is never larger than a few seconds for large (read: 10 million+ record tables) when using a properly optimized (read: good index scheme) set of normalized tables.

        RDBMS' are work very well when finding the 10 rows out of those 10 million that fit your search request. Where they puke is trying to manipulate 50% or more of the data contained in multiple tables: the Data Warehouse/Data Mart.

        Those same Accounts Payable tables make reporting a real pain when they're normalized, so you go through a denormalization (or summarization) procedure to fill out your reporting infrastructure. Pre-summarize your data into a single row with multiple 'buckets' for every strange query procedure you want to view the numbers by. Duplicate data on every row and get those 10+ millon records per month down to a few hundred thousand at most.

        If you have to write custom programs in C, Java, Perl or your language of choice in order to operate on an exported version of the data, so much the better; the database won't perform as well when acting on every row in the database as a program optimized to summarize it. Once that's done, load it back in.

        By using this approach, you're able to use the best of both the normalized and denormalized approaches and satisfy both the data entry clerks -- because the data is entered quickly -- and the managers -- because they can get virtually any report in under 30 seconds.

        Sweeping comments that suggest that one method or another is bad in all cases just screams: "Look at me! I don't know what I'm doing but I'm going to tell you that what you're doing is wrong, anyway!"

    • i never really found a clear reason on which was better performance-wise, though i suspect the filesystem-based way is. i also found it to be less of a hassle to implement. any intelligent thoughts?

      Serving images from the filesystem is always faster than serving them from a database, since the database is also on the filesystem. (If you think that the DB could cache the results and therefore be faster, just serve your most frequently requested images from a filesystem in RAM (a RAM-disk) -- of course you can't cache more than what would fit in RAM, no matter if you do it yourself or with the DB.)

      Using DB for serving images can make sense when you serve different images in different cases like banners, where you want to have simpler CGI scripts (it's because of convenience, not because of performance). But using database doesn't have to mean storing images in a database.

      If I found out that serving some of my images would be easier (and that means less error prone, better to maintain, etc.) if I used a database, I would do something like this: I'd store the digests of my images in the DB (it could be a binary MD5 for example -- just 16 bytes per image). Then, when my CGI script gets the digest, it would use it to find image on the filesystem.

      So, if it gets 9743a66f914cc249efca164485a19c5c it serves /images/97/43/a66f/914cc249efca164485a19c5c.png (splitting a digest to get different directories depth would depend on the filesystem and number of images of course, this is just an example; also one could use less than 128 bits of the digest if it would be enough to have 96 or 64 (depanding on the number of images) to have shorter paths). The .png suffix could be also stored in the database to allow easy use of few different image formats.

      (The DB could store more human-friendly paths instead of message digests of images, but would need more human interaction -- it's probably a matter of taste).

      This way you still can have a cluster of very simple statical image serving servers in the future, while having the benefits of databse. Also the database traffic is much much lower. The only difference with your scripts is that you send a redirect instead of actual content, which is even easier.

      The most important benefit, however, is that you can have statical images, database and CGI frontends split into three independent machines or even groups of machines, when your traffic become to high for an all-in-one sollution. Because with the DB-only BLOB images you better have lots of money for a database cluster (and DB-CGI bandwidth of redundant internal traffic). Using a database-stored images when all you need is easier searching with SQL queries is in my opinion just using database as an expensive filesystem.

    • any intelligent thoughts?


      They are preparing for the new SQL based FS from MS....
    • postgreSQL supports BLOB's, but the organisation suggests to store the names in a table and the binary objects in the filesystem.

    • The only reason why I could see a RDBMS advantage to image storage is that you can (usually pretty easily) change the block/page size for storage/retrieval to improve fetching of large chunks of data. Usually you can also have finer-grained memory control to ensure they are forced in memory.

      That said, I think it is far wiser and much more cost effective to store 'dumb' (non-relational) content like images on the web/app server.
    • i never really found a clear reason on which was better performance-wise, though i suspect the filesystem-based way is. i also found it to be less of a hassle to implement. any intelligent thoughts?
      The only intelligent thought I can add to this is: sometimes performance isn't the only factor. Yes, of course--all things being equal-- the filesystem is faster, but filesystems can't enforce data integrity constraints. I agree that it is ludicrous to store non-critical images in the database, such as standard website graphics, photo albums, etc... BUT, if you are creating a document management system, and you want to make sure that scanned document X cannot be deleted unless user Y has viewed it, then a database is often the most sensible solution. Otherwise, your code will have to enforce constraints in two places. And what happens if the IT dept wants another application to play with the same data? If it's all in the database, and your constraints are in the database, then you sleep a little easier.
    • That's why your database is an excellent place to store the metadata - such as captions, which page they appear on, what the thumbnail is, and so forth - with a pointer to a location on the file system where the image resides. Best of both worlds.

  • You obviously are working with morons. Very few data-oriented applications need to write their own data-stores. Almost anything you can imagine (complex relational data, object-oriented data, xml stuff, photos, video footage, 3/4D spatial data, etc, etc..), someone has written database software tuned for it. Use it and be happy.
  • A custom DB will be faster than a general purpose DB (by definition).

    A relational DB offers great flexibility (pull any data from the database, add most any index to greatly improve the performance, etc).

    I would base the decision mostly on the interfaces to the application. I've worked on applications where the company prefered to access the SQL database directly for reporting and I've worked on projects where the only interface to the application was via HTTP/XML. In the latter case, no one cared how we stored the data, so we dumped XML into a filesystem.

    Joe
    • by joto ( 134244 ) on Tuesday October 29, 2002 @06:30PM (#4559617)
      Experience tells me exactly the opposite. A custom DB will perform better, until you actually start to fill it with lots of data. When that happens, you will find that the many man-years spent developing the expensive proprietary DBMS systems actually resulted in something better than what you could quickly hack together over a few days.

      That being said, there are still lot's of valid reasons not to use a real DBMS for every small project. The most important is simplicity. Bringing in hundreds of megabytes of third-party software to store a few kilobytes of data is not only overkill, it's also a maintenance nightmare!

      • The maintenance nightmare is the best point; why spend x man-hours writing a custom data store when there are thousands of man-hours of experience in Oracle/PostgreSQL/MySQL/MS SQL/whatever? You have access to tried & tested code under a huge variety of circumstance.

        Take the time saved by using an out of the box solution (i.e. the DB) and buy a bigger box to handle any inefficiencies you get by using a DB.

        • Because then you'd have to understand Oracle/PostgreSQL/MySQL/MS SQL/whatever to be able to fix it. Furthermore, you need it to be available to deploy it. And you rely on continued support from the company/individuals in question for it to last more than a few years. If it can be done with a simple text-file, or as a hierarchical database using the file-system, that will make the product simpler, smaller, and less reliant on third-party software. Sometimes reinventing the wheel is a good thing, if the wheel is sufficiently simple...
          • Having played with each of the methods you describe above, I can't imagine how to make data storage simpler than SQL (except for a better relational language, perhaps). Heirarchical data storage is a disaster waiting to happen. (And I mean XML, too).

            I mean, of course anything can be argued to the point of stupidiy. If you just want to store a few flat lists of items, then sure: use a text file, but once you go beyond that (and you always will), standardization is your friend.

            And we're not talking about hundreds of megabytes of 3rd-party software. Really, the PostgreSQL install footprint is a few megabytes. Or try Interbase/Firebird, which is only about a 3 MB download these days. Both of these DBMS's use ANSI standard SQL, so if you just learn SQL once, you pretty much can use the basics anywhere.

            Of course, different DBMS's have different advanced/complex capabilities, but by the time you start needing advanced capabilities, it is still quicker to learn a DBMS than create your own advanced data storage.
          • I'd rather rely on the availability of Oracle DBAs than the availability of a contractor who hacked together thier own DB substitute.

            If the problem is sufficiently simple that a full RDBMS is overkill, some db variant is perfect and ships on every *ix.

  • I have to agree that database reuse is among the most essential parts of running a profitible business. I've worked with all sorts of RDBMSes, from MS to Oracle to PostgresSQL, on everything from the lowliest hand-me-down Linux server to top-of-the-line Big Iron, and I can tell you that any modern database is going to be able to take whatever you can throw at it. I like being able to whip out whatever data we have, shove it in and pull it out again, repeatedly and at a moment's notice. It's this kind of flexibility that makes us keep coming to database systems in the first place.
  • When you work in an environment where dog + uncle is either an Oracle DBA, MCSE or VP programmer, you wind up with projects like this: import random length string data from an embedded device into Oracle table, write some really convoluted stored procedures to parse the string data, write a nifty VB program to fetch the results for display to the users, who don't edit the data, just view it.

    In short, you get some expensive licenses and a lot of work to manage what could be done with a short shell script.

    Anybody ever hear about: grep, cut, sed, awk, sort, uniq... God forbid Perl?

    When your only tool is a hammer...
  • by toybuilder ( 161045 ) on Tuesday October 29, 2002 @06:12PM (#4559459)
    Sure, there are times when writing an RDBMS-based solution seems like a big overhead. But there's a good reason for using RDBMS on projects that are likely to mutate and add new features over time, and/or have to interoperate with other programs and systems.

    On the other hand, if you just want to stores a small array of data that fits in a 100 line text file, and the program is completely closed and self contained, there's no need for the flexibility of a RDBMS.

    Imagine a business that has to "send and receive stuff"...

    If you're moving two or three little packages to nearby local area businesses, only, you can get by with a small car.

    But imagine your regularly ship objects large and small to locations local and international... Then you need an intermodal transportation system. Sure, your interface might be "the shipping guy", but the backbone of the transportation is heavy duty...

  • by ComputerSlicer23 ( 516509 ) on Tuesday October 29, 2002 @06:18PM (#4559515)
    I write a screen scrapping application, that downloads lots (100K + web pages a week). We write absolutely everything page into the database. 5GB is enough to hold about two days worth of pages. So we can't keep it in our production database. Especially because 95% of the data doesn't need to be stored except for auditting, but we keep it around in case a mistake is found, or a new tread in the pages happens.

    The reason we use the database instead of the filesystem is deathly simple. The database is god-like. I can do point it time recovery, and guarunttee that the database is completely consistant with the recovery point. If I had all that in a filesystem it'd be harder. It means I have hot-rollover capaibility from server to server, without having to duplicate my filesystem from machine to machine, I just let the hot archive logs deal with that. It means I have one backup system, and one failure recovery plan. It means all I have to monitor is the Oracle tablespace to ensure I don't run out of space. It means when I say "commit", I can hold Oracle accountable for ensuring the data is there, rather then having myself held accountable by the management.

    If I was a good little boy and swallowed all the kool-aid, I'd use iFS (Oracle's Internet Filesystem) and it'd be all good. However, I don't I just use a huge array of blob's in my Oracle database.

    Now that said, I have a remote filesystem that all of this data gets spooled to. Once spooled there it gets written to CD. Once the CD's are written, they are then used to find, compare, and if they match delete the blobs out of the Database. The CD then deletes the files out of the spool. Duplicate the CD, compare the two, send one offsite.

    The other reason we use the database, that it's easier to deal with in our application, because writting a join against the filesystem is tricky...

    Kirby

    • The other reason we use the database, that it's easier to deal with in our application, because writting a join against the filesystem is tricky.

      I think this gets to the point.

      Namely, the hierarchical tree model of filesystems pales in comparison to the relational model of modern databases (first generation databases, way back in the day, were hierarchical).

      -- p
  • That's the first question that you should ask yourself when looking at going outside of the datbase for persistent storage. I would imagine that the data is quite important, and unless you want to write all the necessary functaionality in your one-off data storage solution (which may or may not be slower than the a RDBMS or ODBMS solution) you should have a pretty good argument for why a real database should be used. It sounds like you're working with a pretty uninformed technical staff, to be honest.
  • Perhaps it's just that I can't think "outside the square" or something, but I can't really think of ANY application I've worked on where you couldn't make a good case for storing the data in a database. In some cases I've used XML files (when there was only a very small amount of data to be stored) but anything bigger than that I've always stored in a database. Perhaps under some circumstances (that I can't think of right now) you _might_ want to roll your own storage system....but I think these sort of projects would be the exception (handled of course!) rather than the rule.
    • There are some types of data that just can't be adequately stored in a RDBMS without serious kludging and even then it won't function optimally. Bibliographical data is an example; some people I know wrote a book cataloguing app and from all the research they did they found out using a RDBMS wouldn't cut it. There's a data format specially created for this purpose and that's what they used, though I can't really remember what it was.
  • In one scenario, you maintain business logic, info storage, display, and all sorts of other crap. In the other, you only maintain logic and display. It's easier to force your customers into lockin with some proprietary mish mash, but there are others (like myself) who will turn you down flat.

    Internal stuff is similar. Do you want to maintain EVERYTHING, or just half of it?

  • Pluses: the database takes care of synchronization, and nearly takes care of backup/recover. Very nice. Some DB rigor may rub off on your designers.

    Minuses: the DBM is large (in MB, in install/config requirements, and in CPU usage) and your customer may not be running the DBM brand/version which you have tested your app with. Supporting multiple DB vendors is a pain. SQL is sort of standard, but the table definitions tend to vary. Ick.
  • I have found from the projects I worked on for my company that business applications almost always need some sort of database. Databases are just easy to work with and are fast.

    For example a recent project required that it stores 500 000 records (relating to cotton btw) of around 15 columns each. Now if think of that number and how long it would find a particular row or change a set of data. the MS SQL database we used on modest hardware was damn quick, quicker than could be imagined in fact.

    Anyway SQL is like so usefull. With a proper structured SQL query you can do some amazing stuff, very quickly and with very little effort.
  • If the data is not critical or can be easily re-created, then a filesystem will suffice.
    But if the data is critical to the business, and/or not easily be re-created,
    the data should go into a real DB that is Managed Properly(tm).
  • Databases are more than just storage. Disregarding storing and retrieval, a good database has good design.

    If the database won't be designed properly (as in many just-get-the-job-done small businesses) then a specific application may be better. But, if someone will spend the time doing design, the database forces logic and structure onto the system. While this may be an annoyance to sloppy coders, this helps ease usage (because of strict guidelines) and understanding. Yes, understanding. There are times that data is to some extent known, but to a lesser extent understood. A decent database layout increases understanding as the objects and relations must be logical.

    I speak this as a DBA. And, as a DBA, for good or for bad, there is hardly a project that wouldn't benefit from clear data definitions.
  • It really depends on what data you are storing. How much data, how critical, what are you doing with it, etc...

    Sometimes, Excel is good enough. Or XML. Or plain text files. Or a custom file format. etc...

    If the people working on your project have only ever worked with databases, they will want to use databases for everything. Most stuff will fit into a database, but sometimes it is not apprpriate to do so - as other have mentioned here, storing images in a database is not always a good idea, but you would probably want to store the location of the images in a database.

    If you need to be able to ship the data around to different machines/offices/clients/over the net/etc, then maybe an XML file will be best. Custom file formats may be appropriate in some cases too. (Though I'd lean towards a more open file format.)
  • by Samrobb ( 12731 ) on Wednesday October 30, 2002 @12:17AM (#4561645) Journal
    The feeling at my current workplace seems to be that very few projects lend themselves to database usage and that a customized one-off data storage solution should be developed for each project.

    Huh? Do you create a custom C library for every application as well? How about a custom UI toolkit? Custom preprocessor/compiler?

    Sounds kind of silly, doesn't it.

    So why do these folks think a "customized one-off data storage solution" sounds any better? It's the same problem - you can either use something that's already been debugged, tested, and tweaked for performance, or you can spend your own time and effort to create it yourself. That's time and effort that could go towards coding and testing the final product, but is instead spent elsewhere (probably because someone thinks that using a dabatbase for storage would make the application "bloated").

    I think the problem is probably that when you mention using a "database", most people equate that term with "general purpose database server" (Oracle, SQL Server, Postgress, MySQL, etc.) There are libraries available that were specifically designed to offer programs lightweight database access without the pain of using a full-fledged RDBMS. Search Google for embedded database [google.com], xbase library [google.com], or open source database library [google.com] to start... there are any number of toolktis that will allow you to create a very customized storage solution without having to create "one-off" code for each and every project.

    • It really depends on what data is being stored and how much. For many, many applications a simple flat file using random access methods works just fine. For managing more data there are libraries such as you mention. And for some projects a full fledged RDBMS is the right way to go.

      The proper way to go about this is to analyse what needs to be stored and choose the solution that provides the best match based upon needed functionality, system use, and programming time.

      Sean.
      • The proper way to go about this is to analyse what needs to be stored and choose the solution that provides the best match based upon needed functionality, system use, and programming time.

        Absolutely. They way the question was worded, though, makes me think that their typical case is probably beyond the scope of doing anything easily with just flat files... otherwise, why the debate? You'd have to be a truly hardcore DB fanatic to argue that a DB is the ideal solution in all circumstances.

  • I would just like to point out that the filesystem is a particular case of specialized database so this whole debate fs versus db does not make much sense. Some of its implementations may suck, but it's still a database that has a well-known, familiar interface (access mechanism). One can build a file store using a filesystem and replace it with a database-backed "file system" later if need be.

    And we should also keep in mind that while some databases do give killer performance, you have to pay a lot for it, while the ol' filesystem comes with the OS already.

    Now, for any other use than filesystem-related stuff, it does not make sense to try to invent your own small-scale storage mechanism when there are so many good, cheap/free database servers out there well-suited for the job.

  • I didn't see this mentioned anywhere, but just because you have a multi proc system, it doesn't mean you have to run MSSQL on all the procs. You can purchase a single processor license of sql server, but be running it on a quad proc box. To be compliant, you just need to be sure you are setting the processors the service can use (er, its on the 'Processor' tab under the sql server properties).

    This isn't necessarily a bad thing to do either. When you are having to be conservative with your cash, a lot of the times these boxes have to serve multiple purposes. Having the sql server running on only procs 3 & 4 would leave 1 & 2 available to do 'other stuff' (web services? perl scripts?).

    With SQL2k you can even have the development and the production sql server be the same system and generally not effect each other performance-wise when you are thrashing the procs. You just need to setup multiple instances of the service and assign each to separate processors (of course, they won't be completely autonomous since they *are* on the same box, but at least you won't get competition for the processor)

    In any case, I'd go with the quad proc box. Only get one CPU if you want. You can always add to it later and purchase further licenses *if* you need them.
  • by Karora ( 214807 ) on Wednesday October 30, 2002 @03:49AM (#4562492) Homepage
    I have worked as an application developer / designer with DBMS backed applications for the last 17 years. There are reasons for not choosing a database, but not usually very good ones.

    When you want speed and flexibility and scalability and reliability and extendability and particularly developer productivity you will undoubtedly end up shooting yourself in the foot later if you avoid some form of DBMS up front.

    Where you have a particularly well-defined, narrow functionality, and performance in a small footprint is a requirement, an RDBMS may not be such a good choice, but DB libraries like berkeley db can still be very useful.

    And with PostgreSQL [postgresql.org], Firebird [sourceforge.net] , MySQL and so many other free, open-source projects out there covering such a broad spectrum of needs for a database, why would you not use that expert work?

  • When you can imagine querying the data you are entering - you can't easily query images, or other binary data (although I guess there must be someone working on this problem somewhere...). If you can't query it, you should usually find a better place to store it - NAS is usually fine - and maintain a pointer to it (e.g. a filename). Yes, it's something that can get un-synced, but most databases suck when it comes to actually dealing with binary data, and you can use that capacity a lot more effectively elsewhere.



    When the structure of the data is likely to remain stable. If your application deals with well-understood entities, whose properties are unlikely to change over time, a database is a great solution.
    Databases are, however, relatively change-resistant - it's typically a pain in the backside to change the datatype of a column, remove columns etc. So, if you're working in a domain where you continuously learn new things about your core entities, or if your development processes are highly iterative, you might be better off using an alternative data storage mechanism.



    When more than a single user is likely to access the data - yes, you can create locking mechanisms yourself. You can also take your own garbage to the local dump. It's usually not a good use of your time, and the cost of not dealing with the issues involved are expensive, both for garbage and concurrent access to shared data.



    When you require consistency accross transactions - the good old ACID (atomicity, consistency, isolation, durability) principles which become important for many non-trivial applications.



    If you care about enforcing rules of referential integrity - do you want to ensure that all the tracks in your record collection can be tied back to a recording ? Do all orders have to have a customer ? Those things are far simpler to implement with an RDBMS than in code.



    There are instances where using an RDBMS is not appropriate. Ones that spring to mind are :
    - your business domain is not well understood or liable to rapid change. In this case, the cost of change for database objects is likely to be a problem - consider storing data in a self-describing format like XML.
    - the application domain doesn't lend itself to being described in relational terms - image manipulation tools, word processors etc. which deal with mainly binary information probably should not use a relational model for their core data structures.

    Alternatives exist - Object Oriented databases are becoming more and more popular. I have way too little experience with these to comment on their use.

Any program which runs right is obsolete.

Working...