When is Database Muscle Too Much? 93
DBOrNotDB asks: "At some of the places I've worked in the past, there have been DBAs who generally insisted that given accurate specifications and enough hardware and software, you could stuff nearly anything you wanted to into a database, manipulate it, and pull it back out again in a reasonable time. The feeling at my current workplace seems to be that very few projects lend themselves to database usage and that a customized one-off data storage solution should be developed for each project. This seems like a violation of many major software engineering principals (e.g. reuse) to me. My question is, what kind of success or horror stories does the community have about trying put different projects into databases? Numbers (# of rows, tables, total data storage, cost, etc) would be nice, but even just anecdotes would be helpful."
More of a "dilbert" story (Score:3, Funny)
Re:More of a "dilbert" story (Score:1)
Re:More of a "dilbert" story (Score:4, Funny)
Doesn't Google run off of a huge Excel spreadsheet?
Re:More of a "dilbert" story (Score:2)
No, they use pigeons! [google.com]
Re:More of a "dilbert" story (Score:1)
Re:More of a "dilbert" story (Score:2)
I used to work at a place where mission critical data was stored in a massive Paradox database to which all these crazy jarheads had admin rights. Just stupid. You ended up with thousands of copies of tables and no-one knew which one was the right one to use. Then they decided to go SQL Server and what did they do? Remodel the data structure? No. They just copied the damn Paradox tables one for one.
They used to back up the c:\ drives of every machine in the office every lunchtime in that place.
Dickheads.
Lol, I think I worked in the same place! (Score:1)
Lol, probably not but it's amazing what you see in the "real world" of corporate IT departments.
Re:Lol, I think I worked in the same place! (Score:2)
Re:More of a "dilbert" story (Score:2)
Joel Spolsky wored with MS on the Excel team and points out that in user studies they did the ability of excel to record data in such a way was important in it's adoptance. Check out the chapter [joelonsoftware.com] from his excellent book User Interface Design for programmers and search for excel.
Re:More of a "dilbert" story (Score:2)
Just the opposite of what we have here. Folks here seem to think that Access is just a wonderful spreadsheet program. (4 databases, 1 table each, 50+ columns)
"Is database application" (Score:2)
The moral of the story? Any computer application is built most quickly and easily using a database solution.
Re:"Is database application" (Score:2)
Re:"Is database application" (Score:1)
It has less to do with databases, than with tables. A database is just a very structured and very good way to handle tables.
If you can describe things in tables, put it in a database.
I think that the main reason that people use Excel, is because you need no programmer in between. If you want to use a database, then you need to create new interfaces for every table and view that you add. This asks for someone who understands databases and interface design, and knows how to get information out of people, etc...
Btw, isn't the S/400 filesystem a DB/2 database ?
Re:"Is database application" (Score:1)
Well, to oversimplify, yes. But only part of it.
AS400 / I-Series computers actually have multiple file systems, arranged heiraraclly under what they call an "Integrated File System". Some newer file systems will hold any old blob of data in a file, whereas the original file system is basically inherited from the old S38 library system.
Basically, under ROOT there's a branch that follows OS2 or NT file naming conventions, under QOPENSYS a branch that emulates Unix file naming system, etc. These have directories and files, as you'd expect in most OS filesystems. But the QSYS.LIB is special. It's a library, not a directory. On the AS400 libraries are quite a bit more structured. Among other things, a library can act as a database "collection".
Some objects in the libraries are DB2 tables, views, indexes, and transaction journals. Others are stored procedures, queries and application programs.
All other libraries live in QSYS.LIB, and no other library can contain another library.
Now, since the S38 predated the DB2 product, and for that matter SQL, there are many database tools on an AS400 that no other DB2 platform has. And these tools are useful. There used to be a couple of logical file tricks (indexed views) you can do with DDS that SQL didn't do for example. But most tools from the mainframe and unix version have been ported to the I Series.
Neat boxes, those AS400. Single source as all hell though.
afaik... (Score:3, Interesting)
i was always under the impression that the filesystem was a better place for this, assuming the directory structure was simple or fixed (i.e. you wouldn't be creating thousands of subdirs dynamically). why store all your banners in a table as BLOBs when you can simply have a web-accessible directory and store them there?
i never really found a clear reason on which was better performance-wise, though i suspect the filesystem-based way is. i also found it to be less of a hassle to implement. any intelligent thoughts?
Re:afaik... (Score:2)
Re:afaik... (Score:1)
Re:afaik... (Score:4, Insightful)
* Size: I didn't anticipate the user would upload 5 gig of images
* Access: You need some sort of extraneous code to pull those images from the database
* Communication: your code must know how to fake being a proper image (right headers, creation/change times, etc)
* Size 2: A lot of databases can't store high-res images (read: large images) in a database without serious penalties (like chopping them up into little bits to fit into a MySQL bigblob)
Speed can be addressed in the perl world via FastCGI or Mod_perl and similar ways on other platforms, but you'll still have to do disk reads to get the data and you'll pay a price for having one more often used script in memory.
Storage size can be counteracted with clever tricks or "professional grade" (read: expensive) DB Engines like Oracle and DB2 which have binary data storage as one of their features, but you'll need to pay money and have a big muscly machine to run 'em.
For my money, 90% image serving can and should be done from filesystem because that's what web servers are made for. The other 10% are weird meta things like this [picpix.com] that could use the file system but are designed to use database.
My big thing is using your head and ask yourself what will be easier in the long run? Sure, binary data slung around with Perl DBI sounds convenient, but how conevneint is it to run "ismcheck" on a 100 GB database, eh?
Re:afaik... (Score:2)
Same goes for similar applications like Gallery [sourceforge.net], which newest version (v2.0, in progress) will use SQL to drive the backend, and will rely on your filesystem/webserver to serve the images.
Re:afaik... (Score:1)
Re:afaik... (Score:2)
Why not do both? Oracle has a datatype called BFILE, in which the actual data is stored in the filesystem, and the row in the table contains a pointer to it. You have the best of both worlds, filesystem access to the image if you want it, or database access, and you can very easily integrate the image with the rest of your relational data.
Re:afaik... (Score:2, Funny)
Re:afaik... (Score:2, Interesting)
The real killer in database performance comes in two places, large complex joins and full table scans. Eliminate these two things in your db and you should never have scaling problems. To do this, watch your long running queries and make sure they have the proper indexes on the tables. And make sure you keep your statistics updated. But please, for the love of God, don't listen to those moron college profs who say normalization is key. That's all good and fun until you have a million records in each of four tables you need to join to provide a solution. That, or your data model charts take up 300 sq. ft of wall space.
Re:afaik... (Score:4, Informative)
If your normalized tables take a performance hit, buy a bigger box. If you munge the data with replication, you're screwed.
Re:afaik... (Score:3, Insightful)
There is always a case for normalization or denormalization. I have developed many different databases of various sizes. ( 10 terabytes) and as a rule I try to be a s normalized as possible. I have worked on some data conversions that were rathy messy due to denormalization. However, I can't think of a single database where there wasn't some type of denormalization for various reasons. From speeding query results to the client's demands.
Yes, I could have recommended that my clients buy bigger hardware but when a summary table can be used instead of spending $X thousand dollars, what do you think you would do?
Re:afaik... (Score:2)
I would use an Oracle Materialized View and get the speed increase without the extra dollars and without needing the extra hardware. Work smarter, not harder.
Denormalisation (Score:2)
Don't denormalise for the sake of denormalising - the trick is to know when to break the rules and to do so very carefully.
Denormalisation is only one tool that can be used to improve the performance of a system, and of course, other options like more memory, faster CPUs and better code should be addressed first.
Re:afaik... (Score:1)
3rd normal form (Score:2)
Good DBMSes can break complex joins catching the criteria piece by piece. You can also create run time extracts which are used by real time / almost real time systems for read access. However your advice is simply terrible. Once you lose normal form you lose the associative law on your table algebra. That means join operations are not defined independently of order they are performed in and that is very bad. Rick mentions an example of this in terms of addresses but it can get far worse.
Re:afaik... (Score:1)
In our database course we were taught very clearly why and why not to normalise, and how much.
Are US universities really so bad as most
I think the reason is that you have
a) The person who is good at computers, so skips uni and goes to work
b) The person who isn't that good, goes to uni then goes to work
c) The person who is good at computers, goes to uni, and then goes to work.
It seems ppl in group A keep meeting ppl in group B. Perhaps this might be because ppl in group C work in jobs that ppl in group A don't, so they never meet.
Thoughts?
Re:afaik... (Score:1)
Actually our university got the highest points in terms of research. It means that our lecturers are (generalising) not that good at lecturing, but are good in knowledge. That suits me okay tho.
Normalisation (Score:2)
Don't normalise your database at your own peril!
Learn how to properly normalise a database (3NF is usually good enough) and then learn how to write decent queries and tune your indexes appropriately. In some circumstances it may be worthwhile caching some data in extra tables which are refreshed periodically. In even rarer circumstances it may be necessary to denormalise the database, but always normalise it and only denormalise if you encounter some intractible performance issues.
We have some incredibly complex queries at my current client and we have managed to gain performance improvements by re-writing certain queries. Splitting complex queries up and using temp tables, derived tables, sub-queries, etc can help and you will have less of a problem with data anomalies which can creep into a denormalised structure.
Re:Normalisation (Score:2)
> a million records
Thats chump change! Try 50 million and then we'll talk database
Re:Normalisation (Score:2)
Just a quick check of one our systems shows a table with 1.5 million records. Not a large table, but not trivial either. Due to the nature of the data and the structure of the database, it is necessary to execute queries containing 10-12 self-joins routinely. One report requires 16 joins in total. The slowest of these queries takes about 30 seconds, most are well under 2 seconds.
Sure, we could denormalise our data to make the queries simpler (multi-table joins scare DBAs for some reason) but we would lose a lot of the flexibility that our design affords us. We have no need to denormalise this database because we wrote more efficient queries and have paid careful attention to our indexes.
The largest table in our production database is under 10 million rows. We have simulated much larger sizes in our test environments though, and found our joins are not a problem. The performance with 10 times the data is still well within acceptable range for our requirements. (Most stored procedures execute under 5 seconds with this size.) If you have problems with a three table join, head back to school and learn how to do it properly!
You say you are as religious as possible about indexing, but what about the structure of your queries? Are you indexing the right columns? You are using temp tables? Try getting rid of them - in many cases they slow things down. Make sure your queries are sargable. Make sure you are limiting the intermediate result-sets to be as small as possible. Understand the query execution plan. Make sure your server has enough memory! Are your server's settings tuned for your particular application? There are many, meny things to pay attention to, but you should not be having any problems with a 52 million row table in a three-table join!
When you start doing real joins with 10 or more tables (each with >1 million rows) then we'll talk about how to denormalise the data to improve performance.
8-)
Re:Normalisation (Score:2)
Re:Normalisation (Score:2)
Re:afaik... [And that's not very far...] (Score:1)
This is a horrible approach to analyzing any problem. Normailized and denormalized data both have their place in today's RDBMS-driven world. Take any one of the large ERP packages available today: Oracle Financials, PeopleSoft, SAP, etc.
When working with specific data, for example Accounts Payable data, you really don't want to duplicate all of that customer data again and again and again for each row in the database, hence you normalize it. Yes, you pay a bit of a speed penalty when joining against the CUSTOMERS, CUSTOMER_ADDRESSES , INVOICES and INVOICE_LINES. In reality, that difference is never larger than a few seconds for large (read: 10 million+ record tables) when using a properly optimized (read: good index scheme) set of normalized tables.
RDBMS' are work very well when finding the 10 rows out of those 10 million that fit your search request. Where they puke is trying to manipulate 50% or more of the data contained in multiple tables: the Data Warehouse/Data Mart.
Those same Accounts Payable tables make reporting a real pain when they're normalized, so you go through a denormalization (or summarization) procedure to fill out your reporting infrastructure. Pre-summarize your data into a single row with multiple 'buckets' for every strange query procedure you want to view the numbers by. Duplicate data on every row and get those 10+ millon records per month down to a few hundred thousand at most.
If you have to write custom programs in C, Java, Perl or your language of choice in order to operate on an exported version of the data, so much the better; the database won't perform as well when acting on every row in the database as a program optimized to summarize it. Once that's done, load it back in.
By using this approach, you're able to use the best of both the normalized and denormalized approaches and satisfy both the data entry clerks -- because the data is entered quickly -- and the managers -- because they can get virtually any report in under 30 seconds.
Sweeping comments that suggest that one method or another is bad in all cases just screams: "Look at me! I don't know what I'm doing but I'm going to tell you that what you're doing is wrong, anyway!"
Database AND filesystem combo (Score:1)
Serving images from the filesystem is always faster than serving them from a database, since the database is also on the filesystem. (If you think that the DB could cache the results and therefore be faster, just serve your most frequently requested images from a filesystem in RAM (a RAM-disk) -- of course you can't cache more than what would fit in RAM, no matter if you do it yourself or with the DB.)
Using DB for serving images can make sense when you serve different images in different cases like banners, where you want to have simpler CGI scripts (it's because of convenience, not because of performance). But using database doesn't have to mean storing images in a database.
If I found out that serving some of my images would be easier (and that means less error prone, better to maintain, etc.) if I used a database, I would do something like this: I'd store the digests of my images in the DB (it could be a binary MD5 for example -- just 16 bytes per image). Then, when my CGI script gets the digest, it would use it to find image on the filesystem.
So, if it gets 9743a66f914cc249efca164485a19c5c it serves /images/97/43/a66f/914cc249efca164485a19c5c.png
(splitting a digest to get different directories depth would
depend on the filesystem and number of images of course,
this is just an example; also one could use less than 128 bits of
the digest if it would be enough to have 96 or 64 (depanding
on the number of images) to have shorter paths).
The .png suffix could be also stored in the database to allow
easy use of few different image formats.
(The DB could store more human-friendly paths instead of message digests of images, but would need more human interaction -- it's probably a matter of taste).
This way you still can have a cluster of very simple statical image serving servers in the future, while having the benefits of databse. Also the database traffic is much much lower. The only difference with your scripts is that you send a redirect instead of actual content, which is even easier.
The most important benefit, however, is that you can have statical images, database and CGI frontends split into three independent machines or even groups of machines, when your traffic become to high for an all-in-one sollution. Because with the DB-only BLOB images you better have lots of money for a database cluster (and DB-CGI bandwidth of redundant internal traffic). Using a database-stored images when all you need is easier searching with SQL queries is in my opinion just using database as an expensive filesystem.
Digest collisions (Re:Database AND filesystem...) (Score:1)
Well, yes and no. Yes, you're right that the 64-bit parts of MD5 digests are not unique, but so are the full 128-bit digests. Any n-bit digest, provided it's randomly distributed, will be the same statistically every 2^n times, since there are only 2^n different results. Now it's up to you if you think 281474976710656 (48 bits) different digests is enough for you, or you need 590295810358705651712 (69 bits) or full MD5 340282366920938463463374607431768211456 (128 bits).
You have to use enough bits to make sure (well, you never can be sure, like you can't be sure that you won't win a lottery 1000 times in a row -- you get the idea) that two files having the same digest is practically impossible (because it's always theoretically possible, however unlikely). It depends on the number of files you have. For n-bit digest and m files there are 2^nm different results and (2^n)!/(2^n - m)! good results (i.e. those results without collisions).
So, the probability of not having any collisions is (2^n)!/(2^nm (2^n - m)!) but since calculating (2^128)! is not what you want to do (trust me -- a 1000 teraflop supercomputer would need half a million times more time than the age of our universe, provided it would have so much RAM and could handle so long numbers, which I don't even dare estimating). You better write this from the command line, it's a little Perl one-liner I just hacked out of boredom -- yes, I know, I should take my medicine and get some sleep:
perl -le'($n,$m)=@ARGV; for($w=$z=2**$n,++$_;$m;--$m,--$z){$_*=$z/$w} print' n m
It will compute (2^n)!/(2^nm (2^n - m)!) (rounded to your floating point resolution) i.e. it will give you the probability of not having any collisions using n-bit digests with m files (-0 means it's impossible and 1 means it's sure or so possible that almost sure). If anyone asks how does it work -- it's magic. Copyright © 2002 alfaiomega [slashdot.org]. This program is free software; you can redistribute it and/or modify it under the same terms as Perl [perl.com] itself. There is NO warranty; not even for MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE or READABILITY.
Great, I thought no one would read my comment with Score:1. That's good to hear that someone liked it more than the moderators.
Re:afaik... (Score:1)
They are preparing for the new SQL based FS from MS....
Re:afaik... (Score:1)
postgreSQL supports BLOB's, but the organisation suggests to store the names in a table and the binary objects in the filesystem.
Re:afaik... (Score:2)
That said, I think it is far wiser and much more cost effective to store 'dumb' (non-relational) content like images on the web/app server.
Re:afaik... (Score:2)
Re:afaik... (Score:2)
Get a new job (Score:2)
You obviously are working with morons. Very few data-oriented applications need to write their own data-stores. Almost anything you can imagine (complex relational data, object-oriented data, xml stuff, photos, video footage, 3/4D spatial data, etc, etc..), someone has written database software tuned for it. Use it and be happy.
performance v fexibility (Score:2)
A relational DB offers great flexibility (pull any data from the database, add most any index to greatly improve the performance, etc).
I would base the decision mostly on the interfaces to the application. I've worked on applications where the company prefered to access the SQL database directly for reporting and I've worked on projects where the only interface to the application was via HTTP/XML. In the latter case, no one cared how we stored the data, so we dumped XML into a filesystem.
Joe
Re:performance v fexibility (Score:4, Interesting)
That being said, there are still lot's of valid reasons not to use a real DBMS for every small project. The most important is simplicity. Bringing in hundreds of megabytes of third-party software to store a few kilobytes of data is not only overkill, it's also a maintenance nightmare!
Re:performance v fexibility (Score:2)
Take the time saved by using an out of the box solution (i.e. the DB) and buy a bigger box to handle any inefficiencies you get by using a DB.
Re:performance v fexibility (Score:2)
Re:performance v fexibility (Score:2)
I mean, of course anything can be argued to the point of stupidiy. If you just want to store a few flat lists of items, then sure: use a text file, but once you go beyond that (and you always will), standardization is your friend.
And we're not talking about hundreds of megabytes of 3rd-party software. Really, the PostgreSQL install footprint is a few megabytes. Or try Interbase/Firebird, which is only about a 3 MB download these days. Both of these DBMS's use ANSI standard SQL, so if you just learn SQL once, you pretty much can use the basics anywhere.
Of course, different DBMS's have different advanced/complex capabilities, but by the time you start needing advanced capabilities, it is still quicker to learn a DBMS than create your own advanced data storage.
Re:performance v fexibility (Score:2)
If the problem is sufficiently simple that a full RDBMS is overkill, some db variant is perfect and ships on every *ix.
i agree with the poster (Score:2)
Oracle verses Perl (Score:1)
In short, you get some expensive licenses and a lot of work to manage what could be done with a short shell script.
Anybody ever hear about: grep, cut, sed, awk, sort, uniq... God forbid Perl?
When your only tool is a hammer...
Understand your needs and then decide... (Score:4, Insightful)
On the other hand, if you just want to stores a small array of data that fits in a 100 line text file, and the program is completely closed and self contained, there's no need for the flexibility of a RDBMS.
Imagine a business that has to "send and receive stuff"...
If you're moving two or three little packages to nearby local area businesses, only, you can get by with a small car.
But imagine your regularly ship objects large and small to locations local and international... Then you need an intermodal transportation system. Sure, your interface might be "the shipping guy", but the backbone of the transportation is heavy duty...
We use the DB for nearly everything (Score:5, Interesting)
The reason we use the database instead of the filesystem is deathly simple. The database is god-like. I can do point it time recovery, and guarunttee that the database is completely consistant with the recovery point. If I had all that in a filesystem it'd be harder. It means I have hot-rollover capaibility from server to server, without having to duplicate my filesystem from machine to machine, I just let the hot archive logs deal with that. It means I have one backup system, and one failure recovery plan. It means all I have to monitor is the Oracle tablespace to ensure I don't run out of space. It means when I say "commit", I can hold Oracle accountable for ensuring the data is there, rather then having myself held accountable by the management.
If I was a good little boy and swallowed all the kool-aid, I'd use iFS (Oracle's Internet Filesystem) and it'd be all good. However, I don't I just use a huge array of blob's in my Oracle database.
Now that said, I have a remote filesystem that all of this data gets spooled to. Once spooled there it gets written to CD. Once the CD's are written, they are then used to find, compare, and if they match delete the blobs out of the Database. The CD then deletes the files out of the spool. Duplicate the CD, compare the two, send one offsite.
The other reason we use the database, that it's easier to deal with in our application, because writting a join against the filesystem is tricky...
Kirby
Re:We use the DB for nearly everything (Score:2)
I think this gets to the point.
Namely, the hierarchical tree model of filesystems pales in comparison to the relational model of modern databases (first generation databases, way back in the day, were hierarchical).
-- p
Re:We use the DB for nearly everything (Score:5, Interesting)
I've submitted a request into the Oracle iTAR (Technical Assistance Request) system, at 3:00 AM, gotten a call back by 3:10AM, and a resolution by 3:45AM on a relatively esoteric bug. When I say, I get to hold Oracle accountable, I mean, I can hold Oracle accountable to get me up and running pronto with as much data as I'm going to get. It costs an arm and a leg, but the only people I've ever heard of who have lost their data using Oracle just didn't do backups properly. Oracle is pretty serious about keeping your data around. If you stay with stable tested versions, you'll be fine with Oracle.
I've had Oracle help me on de-supportted platforms, using non-standard configurations, doing crazyness of my own making. They always help me when I contact them for support. For production machines, I stay on the tried and true, but Oracle has never let me down in weird situations.
Oracle as a general rule, has *never* told me buzz off it's your fault. They stayed with me, and found it was my fault, and then showed me the doc's where it says I'm doing something that won't work. For all their faults, Oracle has *NEVER* failed me in any way when it comes to support. When it's my fault, I stand there and take it on the chin. However, as a backing store, Oracle will whoop anything I write eight ways to Sunday for speed, reliability, portability, quality, documentation and support. Hand's down. If you can afford it, there is no excuse for writting your own custom storage manager. If you can't afford it, try PostGreSQL. Oracle or PostGreSQL *WILL* be better for 99.9% of the cases out there. Google is one of the few examples of a situation where writting your own is probably a good idea.
For the record, out of the ~200 million records I've processed, I've lost 1, count'em 1 record using Oracle that was Oracle's fault (I've lost any number of them when Linux crashed, but that's my fault, not Oracle's). Even then Oracle clearly identified which one it was, and when it happened, so it was easy to recover.
Kirby
Re:We use the DB for nearly everything (Score:2)
"If Oracle cannot substantially correct a breach of Oracle's warranties in a commercially reasonable manner, you may end your program license, technical support, or other services and recover the license fees, technical support fees or other services fees paid to Oracle under this agreement..."
This no-nonsense agreement was a selling point for my company. If Oracle can't fix the fuck-up, they'll give you your money back. Period. In the contract. No verbal agreements. It's in writing.
I have never lost data using Oracle. Not once. I've had bad sectors on different hard drives crop up and Oracle has ALWAYS been recoverable. Yes, it's more expensive than a solid-gold custom-molded toilet seat with a free midget butt-wiper, but for data integrity you just can't beat it.
Can the data be lost? (Score:1)
RDBMSs Rock (Score:2)
Re:RDBMSs Rock (Score:1)
Maintenance (Score:2)
Internal stuff is similar. Do you want to maintain EVERYTHING, or just half of it?
some pluses, minuses (Score:2)
Minuses: the DBM is large (in MB, in install/config requirements, and in CPU usage) and your customer may not be running the DBM brand/version which you have tested your app with. Supporting multiple DB vendors is a pain. SQL is sort of standard, but the table definitions tend to vary. Ick.
My Expirience (Score:1)
For example a recent project required that it stores 500 000 records (relating to cotton btw) of around 15 columns each. Now if think of that number and how long it would find a particular row or change a set of data. the MS SQL database we used on modest hardware was damn quick, quicker than could be imagined in fact.
Anyway SQL is like so usefull. With a proper structured SQL query you can do some amazing stuff, very quickly and with very little effort.
Can you afford to lose the data? (Score:2)
But if the data is critical to the business, and/or not easily be re-created,
the data should go into a real DB that is Managed Properly(tm).
It isn't just storage (Score:1)
If the database won't be designed properly (as in many just-get-the-job-done small businesses) then a specific application may be better. But, if someone will spend the time doing design, the database forces logic and structure onto the system. While this may be an annoyance to sloppy coders, this helps ease usage (because of strict guidelines) and understanding. Yes, understanding. There are times that data is to some extent known, but to a lesser extent understood. A decent database layout increases understanding as the objects and relations must be logical.
I speak this as a DBA. And, as a DBA, for good or for bad, there is hardly a project that wouldn't benefit from clear data definitions.
Storing data... (Score:2)
Sometimes, Excel is good enough. Or XML. Or plain text files. Or a custom file format. etc...
If the people working on your project have only ever worked with databases, they will want to use databases for everything. Most stuff will fit into a database, but sometimes it is not apprpriate to do so - as other have mentioned here, storing images in a database is not always a good idea, but you would probably want to store the location of the images in a database.
If you need to be able to ship the data around to different machines/offices/clients/over the net/etc, then maybe an XML file will be best. Custom file formats may be appropriate in some cases too. (Though I'd lean towards a more open file format.)
Why not do the whole project as a one-off? (Score:3, Informative)
Huh? Do you create a custom C library for every application as well? How about a custom UI toolkit? Custom preprocessor/compiler?
Sounds kind of silly, doesn't it.
So why do these folks think a "customized one-off data storage solution" sounds any better? It's the same problem - you can either use something that's already been debugged, tested, and tweaked for performance, or you can spend your own time and effort to create it yourself. That's time and effort that could go towards coding and testing the final product, but is instead spent elsewhere (probably because someone thinks that using a dabatbase for storage would make the application "bloated").
I think the problem is probably that when you mention using a "database", most people equate that term with "general purpose database server" (Oracle, SQL Server, Postgress, MySQL, etc.) There are libraries available that were specifically designed to offer programs lightweight database access without the pain of using a full-fledged RDBMS. Search Google for embedded database [google.com], xbase library [google.com], or open source database library [google.com] to start... there are any number of toolktis that will allow you to create a very customized storage solution without having to create "one-off" code for each and every project.
Re:Why not do the whole project as a one-off? (Score:2)
The proper way to go about this is to analyse what needs to be stored and choose the solution that provides the best match based upon needed functionality, system use, and programming time.
Sean.
Re:Why not do the whole project as a one-off? (Score:1)
Absolutely. They way the question was worded, though, makes me think that their typical case is probably beyond the scope of doing anything easily with just flat files... otherwise, why the debate? You'd have to be a truly hardcore DB fanatic to argue that a DB is the ideal solution in all circumstances.
filesystem is a database (Score:1)
I would just like to point out that the filesystem is a particular case of specialized database so this whole debate fs versus db does not make much sense. Some of its implementations may suck, but it's still a database that has a well-known, familiar interface (access mechanism). One can build a file store using a filesystem and replace it with a database-backed "file system" later if need be.
And we should also keep in mind that while some databases do give killer performance, you have to pay a lot for it, while the ol' filesystem comes with the OS already.
Now, for any other use than filesystem-related stuff, it does not make sense to try to invent your own small-scale storage mechanism when there are so many good, cheap/free database servers out there well-suited for the job.
Go for the extra procs. (Score:1)
This isn't necessarily a bad thing to do either. When you are having to be conservative with your cash, a lot of the times these boxes have to serve multiple purposes. Having the sql server running on only procs 3 & 4 would leave 1 & 2 available to do 'other stuff' (web services? perl scripts?).
With SQL2k you can even have the development and the production sql server be the same system and generally not effect each other performance-wise when you are thrashing the procs. You just need to setup multiple instances of the service and assign each to separate processors (of course, they won't be completely autonomous since they *are* on the same box, but at least you won't get competition for the processor)
In any case, I'd go with the quad proc box. Only get one CPU if you want. You can always add to it later and purchase further licenses *if* you need them.
Argh. Posted to the wrong DB Thread! (Score:1)
Embedded applications (Score:4, Insightful)
When you want speed and flexibility and scalability and reliability and extendability and particularly developer productivity you will undoubtedly end up shooting yourself in the foot later if you avoid some form of DBMS up front.
Where you have a particularly well-defined, narrow functionality, and performance in a small footprint is a requirement, an RDBMS may not be such a good choice, but DB libraries like berkeley db can still be very useful.
And with PostgreSQL [postgresql.org], Firebird [sourceforge.net] , MySQL and so many other free, open-source projects out there covering such a broad spectrum of needs for a database, why would you not use that expert work?
Re:DB based GUI + Satellite = Horror (Score:2)
Just because there's a database in the center of everything, doesn't mean the clients can't cache data locally....
When to use a relational database : (Score:2)
When you can imagine querying the data you are entering - you can't easily query images, or other binary data (although I guess there must be someone working on this problem somewhere...). If you can't query it, you should usually find a better place to store it - NAS is usually fine - and maintain a pointer to it (e.g. a filename). Yes, it's something that can get un-synced, but most databases suck when it comes to actually dealing with binary data, and you can use that capacity a lot more effectively elsewhere.
When the structure of the data is likely to remain stable. If your application deals with well-understood entities, whose properties are unlikely to change over time, a database is a great solution.
Databases are, however, relatively change-resistant - it's typically a pain in the backside to change the datatype of a column, remove columns etc. So, if you're working in a domain where you continuously learn new things about your core entities, or if your development processes are highly iterative, you might be better off using an alternative data storage mechanism.
When more than a single user is likely to access the data - yes, you can create locking mechanisms yourself. You can also take your own garbage to the local dump. It's usually not a good use of your time, and the cost of not dealing with the issues involved are expensive, both for garbage and concurrent access to shared data.
When you require consistency accross transactions - the good old ACID (atomicity, consistency, isolation, durability) principles which become important for many non-trivial applications.
If you care about enforcing rules of referential integrity - do you want to ensure that all the tracks in your record collection can be tied back to a recording ? Do all orders have to have a customer ? Those things are far simpler to implement with an RDBMS than in code.
There are instances where using an RDBMS is not appropriate. Ones that spring to mind are :
- your business domain is not well understood or liable to rapid change. In this case, the cost of change for database objects is likely to be a problem - consider storing data in a self-describing format like XML.
- the application domain doesn't lend itself to being described in relational terms - image manipulation tools, word processors etc. which deal with mainly binary information probably should not use a relational model for their core data structures.
Alternatives exist - Object Oriented databases are becoming more and more popular. I have way too little experience with these to comment on their use.