Forgot your password?
typodupeerror
Data Storage Databases Science

How Do You Organize Your Experimental Data? 235

Posted by timothy
from the can't-remember-where-I-put-my-memory dept.
digitalderbs writes "As a researcher in the physical sciences, I have generated thousands of experimental datasets that need to be sorted and organized — a problem which many of you have had to deal with as well, no doubt. I've sorted my data with an elaborate system of directories and symbolic links to directories that sort by sample, pH, experimental type, and other qualifiers, but I've found that through the years, I've needed to move, rename, and reorganize these directories and links, which have left me with thousands of dangling links and a heterogeneous naming scheme. What have you done to organize, tag and add metadata to your data, and how have you dealt with redirecting thousands of symbolic links at a time?"
This discussion has been archived. No new comments can be posted.

How Do You Organize Your Experimental Data?

Comments Filter:
  • by mangu (126918) on Sunday August 15, 2010 @12:16PM (#33257126)

    In my experience, the best thing is to let the structure stand as it was the first time you stored the data.

    Later, when you discover more and more relationships around that data, you may create, change, and destroy those symbolic links as you wish.

    I usually refrain from moving the data itself. Raw data should stand untouched, or you may delete it by mistake. Organize the links, not the data.

  • by Rui Lopes (599077) on Sunday August 15, 2010 @12:30PM (#33257202) Homepage
    The present (and the future) of experimental data organisation, repurposing, re-analysing, etc. is being shifted towards Linked Data [linkeddata.org] and supporting graph [openrdf.org] data [franz.com] stores [openlinksw.com]. Give it a spin.
  • Re:Interns. (Score:4, Informative)

    by spacefight (577141) on Sunday August 15, 2010 @12:37PM (#33257244)
    Yeah right, let the interns do the job. Not. Interns use new tools no one understands, then finish the project during their term, then move on and let the most probably buggy or unfinished project behind. Pitty for the person who has to cleanup the mess. Better do the job on your own, know the tools or hire someone permanently for the whole deptartment.
  • by gmueckl (950314) on Sunday August 15, 2010 @12:49PM (#33257306)

    To everybody here suggesting relational databases: you are on the wrong track here, I'm afraid to tell you. Relational databases handle large sets of completely homogenious data well if you can be bothered to write software for all the I/O around them. This is where it all falls apart:

    1. Many lab experiments don't give you the exact same data every time. You often don't do the same experiment over and over. You vary it and the data handling tools have to be flexible enough to cope with that. Relational databases aren't the answer to that.

    2. Storing and fetching data through database interfaces is vastly more difficult than just using the standard input/ouput or plain text files. I've written hundreds of programs for processing experimental data and I can tell you: nothing beats plain old ASCII text files! The biggest boon is that they are compatible to almost any scientific software you can get your hands on. Your custom database likely is not. Or how would you load the contents your database table into gnuplot, Xmgrace or Origin, just to name a few tools that come to my mind right now?

    I wish I had a good answer to the problem. At times I wished for one myself, but I fear the best reply might still be "shut up and cope with it".

  • by ericbg05 (808406) on Sunday August 15, 2010 @01:10PM (#33257402)
    Others have already mentioned SQLite. Let me briefly expound on the features that are likely the most important to you, assuming (if you'll permit me) that you don't have much experience with databases:
    1. 0. The basic idea here is that you are replacing this whole hierarchy of files and directories by a single file that will contain all your data from an experiment. You figure out ahead of time what data the database will hold and specify that to SQLite. Then you to create, update, read, and destroy records as you see fit--pretty much as many records as you want. (I personally have created billions of records in a single database, though I'm sure you could make more.) Once you have records in your database, you can with great flexibility define which result sets you want from the data. SQLite will compute the result sets for you.
    2. 1. SQLite is easy to learn and use properly. This is as opposed to other database management systems, which require you to do lots of computery things that are probably overkill for you.
    3. 2. Your entire data set sits in a single file. If you're not in the middle of using the file, you can back up the database by simply copying the file somewhere else.
    4. 3. Transactions. You can wrap a large set of updates into a single "transaction". These have some nice properties that you will want:
      1. 3.1. Atomic. A transaction either fully happens or (if e.g. there was some problem) fully does not happen.
      2. 3.2. Consistent. If you write some consistency rules into your database, then those consistency rules are always satisfied after a transaction (whether or not the transaction was successful).
      3. 3.3 Isolated. (Not likely to be important to you.) If you have two programs, one writing a transaction to the database file while the other reads it, then the reader will either see the WHOLE transaction or NONE of it, even if the writer and reader are operating concurrently.
      4. 3.4. Durable. Once SQLite tells you the transaction has happened, it never "un-happens".
      5. These properties hold even if your computer loses power in the middle of the transaction.
    5. 4. Excellent scripting APIs. You are a physical sciences researcher -- in my experience this means you have at least a little knowledge of basic programming. Depending on what you're doing, this might greatly help you to get what you need out of your data set. You may have a scripting language that you prefer -- if so, it likely has a nice interface to SQLite. If you don't already know a language, I personally recommend Tcl -- it's an extremely easy language to get going with, and has tremendous support directly from the SQLite developers.

    Good luck and enjoy!

  • Lab book (Score:2, Informative)

    by PeterKraus (1244558) <peter.kraus@member.fsf.org> on Sunday August 15, 2010 @01:14PM (#33257430) Homepage

    After a couple of months of working in an actual chemical company, as opposed to uni, I've realised how good it is to keep a lab book. Obviously, if you handle 1000's of data daily, it doesn't help that much, but it helps to have written trace of everything you do in the lab.

    Also, keep the actual data intact and work on a copy. We use excel at work, but something using SQL tailored for your needs might be even better.

  • How CMS sorts data (Score:2, Informative)

    by toruonu (1696670) on Sunday August 15, 2010 @01:24PM (#33257484)

    Well CMS is one of the large experiments at the LHC. The data produced should reach pentabytes per year and add to it the simulated data we have a hellava lot of data to store and address. What we use is a logical filename (LFN) format. We have a global "filesystem" where different storage elements have files in a filesystem organized in a hierarchical subdirectory structure. As an example: /store/mc/Summer10/Wenu/GEN-SIM-RECO/START37_V5_S09-v1/0136/0400DDE2-F681-DF11-BA13-00215E21DC1E.root

    the /store is a beginning marker of the logical filename region that different sites can map differently (who uses NFS, who uses http etc etc) /mc/ -> it's monte carlo data /Summer10/ -> the data was produced during Summer of 2010 /Wenu/ -> it's a simulation of W decaying to electron and neutrino /GEN-SIM-RECO/ -> the data generation steps that have been done /START37_.../ -> The detector conditions that have been used (the actual full description of the conditions is in some central database) /0136/ -> is the serial number (actually I'm not 100%, but it's related to the production workflow etc) /0400DDE2-F681-DF11-BA13-00215E21DC1E.root -> the actual filename, the hash is due to the fact that the process has to make sure there are no conflicts in filenames

    Another example: /store/data/Run2010A/MinimumBias/RECO/Jul16thReReco-v1/0000/0018523B-D490-DF11-BF5B-00E08178C111.root

    This file is real data, taken during the first run of 2010 and filtered to the MinimumBias primary dataset (related to event trigger content). The datafiles in there contain RECO content and were done during the re-reconstruction process on July 16th. Then there's again the serial number (block edges define new serial numbers) and then the filename.

    You could use a similar structure to differentiate the datafiles that you actually use. The good thing is that you can map such filenames separately everywhere as long as you change the prefix according to the protocol used (we use for example file:, http [slashdot.org]:, gridftp:, srm: etc). You can also easily share data with other collaborating sites as long as everyone uses similar structure it's quite good. No need for special databases etc. If you need some lookup functionality, then one option is a simple find (assuming you have filesystem access) or you could build a database in parallel and you can use the LFN structure to index things etc.

  • by wolf87 (989346) on Sunday August 15, 2010 @01:25PM (#33257496)
    I have seen these kinds of situations happen a lot (I'm a statistician who works on computationally-intensive physical science applications), and the best solution I have seen was a BerkeleyDB setup. One group I work with had a very, very large number of ASCII data files (order of 10-100 million) in a directory tree. One of their researchers consolidated them to a BerkeleyDB, which greatly improved data management and access. CouchDB or the like could also work, but I think the general idea of a key-value store that lets you keep your data in the original structure would work well.
  • by RandCraw (1047302) on Sunday August 15, 2010 @01:28PM (#33257514)

    First I would lay out your data using meaningful labels, like a directory named for the acquisition date + machine + username. Never change this. It will always remain valid and allow you to later recover the data if other indexes are lost. Then back up this data.

    Next build indexes atop the data that semantically couple the components in the ways that are meaningful or acessible. This may manifest as indexed tables in a relational database, duplicate flat files linked by a compound naming convention, unix directory soft links, etc.

    If you're processing a lot of data, your choice of indexes may have to optimize your data access pattern rather than the data's underlying semantics. Optimize your data organization for whatever is your weakest link: analysis runtime, memory footprint, index complexity, frequent data additions or revisions, etc.

    In a second repository, maintain a precise record of your indexing scheme, and ideally, the code that automatically re-generates it. This way you (or someone else) can rebuild lost databases/indexes without repeating all your design and data cleansing decisions, and domain expertise. This info is often stored in a lab notebook (or nowadays in an 'electronic lab notebook').

    I'd emphasize that if you can't remember how your data is laid out or pre-conditioned, your analysis of it may be invalid or unrepeatable. Be systematic, simple, obvious, and keep records.

  • Re:Use databases! (Score:2, Informative)

    by BrokenHalo (565198) on Sunday August 15, 2010 @02:38PM (#33257856)
    If the dude works at a research institute of minor size and up, they should have IT staff who can do that initial setup for him.

    I'm not quite sure why your post is rated as funny; scientists are not necessarily the best people to be left in charge of setting up databases. I've seen all sorts of atrocities constructed in the name of science, from vast flat files to cross-linked ISAM/VSAM files, and I remember many late nights (with complaints from wife) spent sorting them out when a subscript went out of range.
  • by rsborg (111459) on Sunday August 15, 2010 @03:22PM (#33258104) Homepage
    Document Management Systems [wikipedia.org] are great - they combine (some of) the benefits of source control, file systems, and email (collaboration).

    I would recommend just downloading a VM or cloud image of something like Knowledge Tree or Alfresco (I personally prefer Alfresco), and run it on the free vmwareplayer or a real VM solution if you have one.

    I recently setup a demo showing the benefits of such a system, I was able to, in about one day, download and setup Alfresco, expose CIFS interface (ie, \\192.168.x.x\documents) and just dump a portion of my entire document base into the system. After digestion, the system had all the documents full-text-indexed (yes, even word docs and excel files thanks to OpenOffice libraries), and I could go about changing directory structure, moving around and renaming files, etc. .. and the source control would show me changes. In fact, I could go into the backend and write SQL queries if I wanted to with detailed reports of how things were on date X or Y revisions ago. Was quite sweet. All the while, the users still saw the same windows directory structure and modifications they made there would be versioned and modified in Alfresco's database.

    Here is a bitnami VM image [bitnami.org], will save you days of configuration. If the solution works for you, but is slow, just DL the native stack and migrate or re-import.

  • by mikehoskins (177074) on Sunday August 15, 2010 @04:56PM (#33258602)

    Yes, agreed, a combination is good (SQL + NoSQL + filesystem).

    There is no one-size-fits-all scenario, here.

    However, there is utility in a NoSQL database over a raw filesystem. One feature is indexed search. Another is versioning. Another is the fact that it is extremely multiuser (proper record locking, even if there are multiple writes to the same record). Also, many NoSQL databases (especially MongoDB) have built-in replication, sharding, Map-Reduce, and horizontal scaling.

    MongoDB's GridFS (especially with FUSE support) marries many of these features together. MongoDB does have some SQL DB features (such as indexing/searching and transactions) but not others.

    Check out the whole stack here:
        http://www.mongodb.org/ [mongodb.org]
        http://www.mongodb.org/display/DOCS/GridFS [mongodb.org]
        http://github.com/mikejs/gridfs-fuse [github.com]

It is impossible to travel faster than light, and certainly not desirable, as one's hat keeps blowing off. -- Woody Allen

Working...