How Do You Organize Your Experimental Data? 235
Posted
by
timothy
from the can't-remember-where-I-put-my-memory dept.
from the can't-remember-where-I-put-my-memory dept.
digitalderbs writes "As a researcher in the physical sciences, I have generated thousands of experimental datasets that need to be sorted and organized — a problem which many of you have had to deal with as well, no doubt. I've sorted my data with an elaborate system of directories and symbolic links to directories that sort by sample, pH, experimental type, and other qualifiers, but I've found that through the years, I've needed to move, rename, and reorganize these directories and links, which have left me with thousands of dangling links and a heterogeneous naming scheme. What have you done to organize, tag and add metadata to your data, and how have you dealt with redirecting thousands of symbolic links at a time?"
Use databases! (Score:4, Insightful)
Subj.
If you have something more complex than a flat file, then use relational databases. Even Access databases are better than a collection of text files.
sqlite (Score:4, Insightful)
http://www.sqlite.org/ [sqlite.org] a "replacement for fopen()" -- http://www.sqlite.org/about.html [sqlite.org]
I used to be anal about organization... (Score:2, Insightful)
...but then google came along and taught me that it's not about know where things are, but rather about being able to find them. My email, for instance, is "organized" by the year in which it arrives, and I use the search function of my email client to find things. No big folder structure, moving messages around, and I haven't had problems finding any email I need. Oh yes, I keep them all... good fodder for "on x/x/xxx you said..." retorts.
For files, then, the key is to have descriptive file names that provide readily searched text. Including the data somewhere in the name (I tend to use this format because it sorts well: 20100815) makes it easier to sort through multiple versions.
Then, you can spend quality time figuring out how to reliably back up all that stuff.... :)
Re:Use databases! (Score:3, Insightful)
If you have something more complex than a flat file, then use relational databases. Even Access databases are better than a collection of text files.
That really depends on what your intended use for them is. I mean I don't know this particular fellow's situation for data collection or what tools he uses for reporting and visualization but perhaps, for him, it's a much better idea to store them in flat files. Me? I have been using flat files for all my data collection about local crime (see here [lazylightning.org], here [lazylightning.org], here [lazylightning.org], and here [lazylightning.org]) for several reasons:
1. I script it all with awk/sed to scrape the data and then put it in a CSV for summary with MySQL.
2. Yes, I could use MySQL for it all but I like to easily see it in its raw format on another remote machine. I also like to use Excel to do ad-hoc pivots and this is the easiest way for me to do that.
3. I upload the data to Google Docs and use their gadgets to make charts for my dashboards and maps. If I were to store it solely in MySQL I would have to make the CSV, pipe it into the MySQL, convert it back out to CSV and then upload it. An additional step for nothing.
Hey, no method is perfect for everyone and every project is a little different and while it's hard for me, based on the information provided, to give this guy any help, automatically suggesting that he needs a relational database to do his data storage might be just a little shortsighted.
YMMV.
Re:Use databases! (Score:3, Insightful)
The question he is really asking is probably more like: 'Is there anything like Windows live photo gallery that will allow me to tag and sort all of my data in a variety of ways like wlpg lets me sort my pictures?'
Try using a scientific workflow system (Score:3, Insightful)
What about a wiki? (Score:3, Insightful)
Use tags in Apple OS X (Score:3, Insightful)
Re:Relational Databases won't do! (Score:1, Insightful)
You can just pipe the output from the SQL client to a text file (or export the results to a CSV file if you use a Query Browser).
Re:Use databases! (Score:3, Insightful)
Translation: I am not a DB guru, but I deal with massive amounts of complex data and need a DB guru, but I have no intent on hiring one.
Seriously, hire a DB wizard in the DB software of your choice for a couple of days. Have him setup the data and optimize it. You'll save yourself a lot of headaches, AND put yourself in a good position for future data maintenance. Imagine that your project gets a lot of attention in the future, and you suddenly get a lot of funding and the money to hire more people Or imagine that you'd like to provide or incorporate data with some outside sources or other researchers. If you're using something "standard" like a relational DB, it will be much easier to hire a DB wizard then trying to find a programmer who can piece together a lot of mismatched files and convoluted organization schemes.
This is what databases are designed to do. Just because you're not an expert at setting them up, and theres a performance hit to setting them up wrong, doesn't mean that they aren't still the right tool.
You need a LIMS (Score:2, Insightful)
A Laboratory Information Management System [wikipedia.org] will help you store, organize, analyze and data-mine your data.
Re:Use databases! (Score:3, Insightful)
MySQL: click the install button (use default options, they're all fine)
phpMyAdmin: put in document root, configure ("click the install button")
And you're set. How was that hard...?
Some software is, in fact, difficult to set up and maintain. As a scientist with an unusually large sample collection, learning to use a database is probably a good idea. Many scientists are taught MATLAB, and setting up a WAMP stack is much, much easier than learning MATLAB.
I'm amused by all the /.ers suggesting the physics nerd setup a SQL database. It ain't super easy and I can't imagine anyone that isn't a programmer of some sort doing so.
Scientists are pretty smart. He should learn to use a database.
Consistently (Score:4, Insightful)
Word up. I'd say the first goal is to store your raw, bulk data consistently. Then you can have several sets of post processing scripts that all draw from the same raw data set.
You want this data format to be well-documented, but I wouldn't bother meticulously marking it up with XML tags and other metadata or whatever. You just want to be able to read it fast, and have other scripts be able to convert it into other formats that would be useful for analysis, be it matlab, octave, csv, or some tediously marked-up XML. You do want to be able to grep and filter the data pretty easily, so keep that in mind when you're designing the format. It will likely end up being pretty repetitive, but that's OK, since you'll likely store it compressed. That can improve performance when reading it, since the storage medium you're pulling the data from is often slower than the processor doing the decompression... and it also provides some data integrity / consistency checking. Oh, and of course, you can store more raw data if its compressed.
Re:Use databases! (Score:4, Insightful)
Re:Relational Databases won't do! (Score:4, Insightful)
I hereby humbly suggest that you are, instead, wrong. Here is why:
Scientists are not software developers and never want to be that. They want to run their experiments and analyse their data. The latter requires recording and processing of numerical data. This is where computers enter their workflow - as number crunching tools that have to be easy to use and utterly flexible.
At times, my work consisted of writing lots of one-off C and Python program to process data in ever new ways in order to get an idea what I was actually looking at. And I had to write them myself because these weren't your run of the mill analysis steps. Many of these programs were not run again once I had their results. During all this time, I as a scientist was looking to get the data in and out of the programs in ways that are easy to code without getting distracted from what I wanted to achieve scientifically. My head was full of theory and formulas, not data structures and good software design.
In that particular state of mind, writing SQL isn't one of the things that I would have wanted to spend any time on. The inherent complexities are a distraction and a big one at that. And, hell, I'm one of the guys who actually *know* SQL. Most scientists actually don't. Hell, many of them barely know how to use their favorite language's core libs to their advantage. They don't care and - may I say - rightly so.
Besides, the code would get more bloated. If I want to output three values that belong together I write a print statement that places them on the same line of text in the output file and I'm done. That's a one-liner that takes me about 20 seconds to type in. In the worst case, I need to open a file beforehand and close it afterwards instead of piping it into stdout. That's maybe 3 lines of code. Now tell me: how many lines of code do I need to write to place these values in a database? That is, provided that a table already exists to hold that data.
My point is: relational databases don't do the job for scientists. Instead, they get in the way. And you and anyone else here who is arguing in favor of them probably lack the related experience to understand that - no offense intended. The points you make are derived from pure theory. Respect the needs of the users as well, please.
Maybe there is a middle ground here: hire a software developer who builds and maintains the DB and a nice, convenient to use wrapper library around it for you. That'll take a while and someone will have to foot the bill for it.
Great comments (Score:3, Insightful)
Reading these comments has changed my thoughts on data storage a little bit, but has reinforced my idea that databases are a bad idea for this sort of thing.
The main issues I have with using databases are file size (I store and convert text files that are 10-100MB zipped), and mutability (generated data doesn't typically change, I just add new experiments on top of other data). A secondary issue is that for plain-text data files (or plain-text convertible data files), writing code is easier when you don't have to bother about a database middleman.
So, if I were to do [another] large research project in the future, here's my thoughts on what I would consider an appropriate approach:
My most common uses for old data are re-running analyses (generating new data as results), and sending data to someone else. It helps to be able to make those things as quick as possible.