Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
×
Data Storage Databases Science

How Do You Organize Your Experimental Data? 235

digitalderbs writes "As a researcher in the physical sciences, I have generated thousands of experimental datasets that need to be sorted and organized — a problem which many of you have had to deal with as well, no doubt. I've sorted my data with an elaborate system of directories and symbolic links to directories that sort by sample, pH, experimental type, and other qualifiers, but I've found that through the years, I've needed to move, rename, and reorganize these directories and links, which have left me with thousands of dangling links and a heterogeneous naming scheme. What have you done to organize, tag and add metadata to your data, and how have you dealt with redirecting thousands of symbolic links at a time?"
This discussion has been archived. No new comments can be posted.

How Do You Organize Your Experimental Data?

Comments Filter:
  • Here (Score:2, Funny)

    by Xamusk ( 702162 )
    I store them in first posts.
  • Use databases! (Score:4, Insightful)

    by Cyberax ( 705495 ) on Sunday August 15, 2010 @12:13PM (#33257106)

    Subj.

    If you have something more complex than a flat file, then use relational databases. Even Access databases are better than a collection of text files.

    • Re: (Score:3, Insightful)

      by garcia ( 6573 )

      If you have something more complex than a flat file, then use relational databases. Even Access databases are better than a collection of text files.

      That really depends on what your intended use for them is. I mean I don't know this particular fellow's situation for data collection or what tools he uses for reporting and visualization but perhaps, for him, it's a much better idea to store them in flat files. Me? I have been using flat files for all my data collection about local crime (see here [lazylightning.org], here [lazylightning.org], here [lazylightning.org],

      • I agree. It depends.

        Yes, relational databases store and retrieve well-defined data very, very well. Do you have referential integrity needs?. If that's your situation, use SQLite (small data and very simple types but little referential integrity), MySQL (medium to large data), or PostgreSQL (medium to very large data or more complex data types) and don't look back. SQL queries, relationships, and referential integrity are very powerful.

        If not, then I'd look at MongoDB with GridFS. I'd even go further a

        • by tibit ( 1762298 )

          One can mix-n-match: use flat files to store raw data, post-processed stuff, etc, but use a database to keep track of everything. The latter can even be handled by your OS from the get-go. On OS X you could write a spotlight plugin or two for your data files, and as long as the file format allows storing metadata within the files, spotlight will index it. Same goes for Windows Search. You could also use native mechanisms for adding metadata to files - those exist IIRC on both Windows 7 and on OS X.

          • Re: (Score:3, Informative)

            by mikehoskins ( 177074 )

            Yes, agreed, a combination is good (SQL + NoSQL + filesystem).

            There is no one-size-fits-all scenario, here.

            However, there is utility in a NoSQL database over a raw filesystem. One feature is indexed search. Another is versioning. Another is the fact that it is extremely multiuser (proper record locking, even if there are multiple writes to the same record). Also, many NoSQL databases (especially MongoDB) have built-in replication, sharding, Map-Reduce, and horizontal scaling.

            MongoDB's GridFS (especially

      • Re:Use databases! (Score:4, Interesting)

        by mobby_6kl ( 668092 ) on Sunday August 15, 2010 @02:50PM (#33257916)

        Certainly it depends, YMMV, and all that. Still, I think that some of the points that you bring up are not actually arguing against a relational database, perhaps just for a slight reorganization of your processes.

        1. I don't know where you get the data from, but anything awk/sed can do, so can Perl. And from Perl (or PHP, if you're lame) it's very easy to load the data into a database
        2. It's easy to connect to an SQL server from the remote machine and either dump everything or just select what you need. You'll need more than notepad.exe to do this, but it's not rocket science. Pivots in Excel can be really useful, but Excel can easily connect to the same database and query the data directly from there and use it for your charts/tables.
        3. Since by this point you'll already have all the data in the db, exporting it to CSV would be trivial. Or you could even skip Google Docs entirely and generate your charts with tools which can automatically query your database.

        I agree with your final point though, we really have no idea what would be best for the submitter within the limitations of skills, budget, time, etc. Perhaps flat files are really the best solutions, or maybe stone tablets are.

    • Re: (Score:3, Insightful)

      by Idiomatick ( 976696 )
      I'm amused by all the /.ers suggesting the physics nerd setup a SQL database. It ain't super easy and I can't imagine anyone that isn't a programmer of some sort doing so.

      The question he is really asking is probably more like: 'Is there anything like Windows live photo gallery that will allow me to tag and sort all of my data in a variety of ways like wlpg lets me sort my pictures?'
      • The question he is really asking is probably more like: 'Is there anything like Windows live photo gallery that will allow me to tag and sort all of my data in a variety of ways like wlpg lets me sort my pictures?'

        That is what Nepomuk is supposed to do: allow you to build semantic meaning into your data. The problem I have with it as it current stands is that the Nepomuk processes suck the life out of my computers, so I disable Nepomuk entirely.

      • Re: (Score:2, Funny)

        by grub ( 11606 )
        If the dude works at a research institute of minor size and up, they should have IT staff who can do that initial setup for him.
        • Re: (Score:2, Informative)

          by BrokenHalo ( 565198 )
          If the dude works at a research institute of minor size and up, they should have IT staff who can do that initial setup for him.

          I'm not quite sure why your post is rated as funny; scientists are not necessarily the best people to be left in charge of setting up databases. I've seen all sorts of atrocities constructed in the name of science, from vast flat files to cross-linked ISAM/VSAM files, and I remember many late nights (with complaints from wife) spent sorting them out when a subscript went out of
          • Re:Use databases! (Score:4, Insightful)

            by bunratty ( 545641 ) on Sunday August 15, 2010 @03:15PM (#33258066)
            I've helped my wife, who is a research scientist, by writing scripts to process her data. The IT departments at the companies where she's worked have no idea about what her work is or how she does it, and perhaps have even less interest in helping her. Their function is to keep the infrastructure (networks, file servers, email servers, etc.) working and install software packages onto the computers. They aren't of any use in helping individual users with their work. They will install SAS and S-Plus but will not help by writing SAS and S-Plus code, for example. You might be in the same situation as I am from your comment about your wife.
            • by oneiros27 ( 46144 ) on Monday August 16, 2010 @07:41AM (#33262188) Homepage

              The problem is, most IT people have no idea what do with science data -- it'd be like going to a dentist because you're having a heart attack. They might be able to give general advice, but have clue what specifics need to be done. Likewise, IT might be people who are really good at diagnosing hardware, but they might suck at writing code. Not all IT specialists are cross-trained in enough topics to deal with this issue effectively (data modeling, UIs, database admin, programming, and the science discipline itself).

              There's a field out there called 'Science Informatics'. It's not a very large group, but there's a number of us who specialize in helping scientists organize, find, and generally manage data. Think of us as librarians for science data.

              Most of us would even be willing to give advice to people outside our place of work, as the better organized science data is in general, the more usable and re-usable it is. There's even a number of efforts to have people publish data, so it can be shared, verified, etc. And most of us have a programming background, so we might be able to share code with you, as we try to make it open source where we can, so we don't all have to re-solve the same problems.

              Because each discipline varies so much, both in how they think about their data, and what their ultimate needs are, we tend to be specialists, but there's a number of different groups out there, for example:

              There's also Bioinformatics, Health/medical informatics, chemical informatics, etc. plug in your science discipline + 'informatics' into your favorite search engine, and odds are you'll find a group, or person you can write to to try to get more info and advice.

              Recently, NSF just funded a few more groups to try to build out systems and communities : DataOne [dataone.org] and the Data Conservancy [dataconservancy.org], and I believe there's some more money still to be awarded.

        • by bkaul01 ( 619795 )

          Funny indeed. Having worked at both the university and National Lab levels, the IT departments I've encountered have more of the function of pushing so many restrictive security policies and pieces of corporate spyware to everyone's systems than of actually enabling researchers to be more productive...

          Really, this is exactly what WinFS would've been perfect for, if MS had ever gotten it working and released it. As it is, I use hierarchical directory structure - top level is the research project, then date/

      • I'm amused by all the /.ers suggesting the physics nerd setup a SQL database. It ain't super easy

        one SQL database? As in, one ginamic SQL database? yeah, that's a headache.

        OTOH, one SQL database per "type" of data (i.e., "last summer's research.sql")? Hell yes. If it's not in a database, you need to spend the ten minutes to learn how to make one. MS Access exists for a reason, and this is it. It won't be a very pretty db, but it doesn't need to be -- it'll be indexed, searchable, and more or less protected.

        The only reason NOT to put it into a relational database is if you have some system that's

      • Re: (Score:3, Insightful)

        by BlitzTech ( 1386589 )
        Apache: click the install button (use default options, or switch to non-service mode which it very clearly explains means it only runs when you run it instead of whenever you start your computer)
        MySQL: click the install button (use default options, they're all fine)
        phpMyAdmin: put in document root, configure ("click the install button")

        And you're set. How was that hard...?

        Some software is, in fact, difficult to set up and maintain. As a scientist with an unusually large sample collection, learnin
      • Any physics nerd should be able to set up a MySQL database pretty easily-- it's not quite as easy as falling out of a tree, but it's not anywhere near as difficult as a lot of other things in physics. A great deal of data acquisition and analysis for many (if not most) physical scientists involves a bunch of custom programming, and many of the theoretical sorts do a lot of computer modeling. MySQL is pretty easy to install on just about anything, and if you have a reasonable idea of what your data will lo

      • by tibit ( 1762298 )

        Umm, what? Any decent scientist these days should be able to program. Setting up an SQL database takes next to no time, as long as you use the right tools to do it. pgAdmin comes to mind, or oo.ORG database frontend. Now the problem may be that many scientists and engineers still don't know how to program, but that's another story.

    • I agree that this is a candidate for a database. One problem with data bases for researchers is that generally one does not know the right schema before hand ond one is dealing with ad hoc contingencies a lot. Another is portability to machines you don't control or that are not easily networked. A final problem is archival persistence. I can't think of a single data base scheme that has lasted 5 let alone ten years and still function without being maintained. Files can do that.

      So if you want some banda

      • I agree that this is a candidate for a database. One problem with data bases for researchers is that generally one does not know the right schema before hand ond one is dealing with ad hoc contingencies a lot. Another is portability to machines you don't control or that are not easily networked. A final problem is archival persistence. I can't think of a single data base scheme that has lasted 5 let alone ten years and still function without being maintained. Files can do that.

        The article didn't actually ask for a way to organize data, he asked for a way to organize files. He could easily create a database that points to whatever files he wants. Populate the relevant columns, and if he wants to add another type of data to search on, add a column. It's not like you need data in every cell, you simply need data in the cells that you want to find again.

        • The article didn't actually ask for a way to organize data, he asked for a way to organize files. He could easily create a database that points to whatever files he wants. Populate the relevant columns, and if he wants to add another type of data to search on, add a column. It's not like you need data in every cell, you simply need data in the cells that you want to find again.

          And right now he's using a complicated mess of symlinks that amounts to a db schema that's probably a huge pain to maintain. Pick one straightforward way to organize files (e.g. date, with directories by month or something) and use the db for sifting through them to pick files by pH, lunar phase, and hair color (or whatever).

    • Re:Use databases! (Score:5, Interesting)

      by rumith ( 983060 ) on Sunday August 15, 2010 @12:54PM (#33257332)
      Hello, I'm a space research guy.
      I've recently made a comparison of MySQL 5.0, Oracle 10i and HDF5 file based data storage for our space data. The results [google.com] are amusing (the linked page contains charts and explanations; pay attention to the conclusive chart, it is the most informative one). In short (for those who don't want to look at the charts): using relational databases for pretty common scientific tasks sucks badly performance-wise.

      Disclaimer: while I'm positively not a guru DBA and thus admit that both of the databases tested could be configured and optimized better, but the thing is that I am not supposed to. Neither is the OP. While we may do lots of programming work to accomplish our scientific tasks, being a qualified DBA is a completely separate challenge - an unwanted one, as well.

      So far, PyTables/HDF5 FTW. Which brings us back to the OP's question about organizing these files...

      • Re: (Score:3, Insightful)

        by Dynedain ( 141758 )

        Translation: I am not a DB guru, but I deal with massive amounts of complex data and need a DB guru, but I have no intent on hiring one.

        Seriously, hire a DB wizard in the DB software of your choice for a couple of days. Have him setup the data and optimize it. You'll save yourself a lot of headaches, AND put yourself in a good position for future data maintenance. Imagine that your project gets a lot of attention in the future, and you suddenly get a lot of funding and the money to hire more people Or imagi

        • Re:Use databases! (Score:4, Interesting)

          by rockmuelle ( 575982 ) on Sunday August 15, 2010 @02:22PM (#33257776)

          I've built LIMS systems that manage peta-bytes of data for companies and small scale data management solutions for my own research. Regardless the scale, the same basic pattern applies: Databases + files + programming languages. Put your meta-data and experimental conditions in the database. This makes it easy to manage and find your data. Keep the raw data in files. Keep the data in a simple directory structure (I like instrument_name/project_name/date type heiracchies, but it doesn't really matter what you do as long as you're consistent) and keep pointers to the files in the database. Use Python/Perl/Erlang/R/Haskell/C/whatever-floats-your-boat for analysis

          Databases are great tools when used properly. They're terrible tools when you try to shoehorn all your analysis into them. It's unfortunate that so few scientists (computer and other) understand how to use them. Also, for most scientific projects, SQLite is all you need for managing meta-data. Anything else and you'll be tempted to to your analysis in the database. Basic database design is not difficult to learn - it's about the same as learning to use a scripting language for basic analysis tasks.

          The main points:

          1) Use databases for what they're good at: managing meta-data and relations.
          2) Use programming languages for what they're good at: analysis.
          3) Use file systems for what they're good at: storing raw data.

          -Chris

        • Depending on the size and stability of the GPs research budget, that may not be practical. I worked on a fairly large academic research team (by EE/CS standards) that had the budget to hire a few full-time staff members for certain things. After the main implementation push the project wound down a bit, and those staff moved on to other jobs, leaving the grad students to maintain the infrastructure. That was fine as it was, but could have been massively not-fine if the staff had used complex tools that r

          • by tibit ( 1762298 )

            That's where documentation comes handy. They did document the system enough so that a less-than-specialist could still keep it running, right?

            • In my case, yes, the system was exceedingly well documented, and also made use of standard tools (Makefiles, perl and bash scripts, etc.).

              But I don't think documentation is a panacea if the tool used is particularly rarified. Perhaps the DBA in question (this is purely hypothetical now) set something up using Oracle, and then left. Now, maybe it's easy enough to use as the interface for SQL queries and the like, but what happens if there are major reorganizations that really do require specialized know
      • Re: (Score:2, Interesting)

        by clive_p ( 547409 )
        As it happens I'm also in space research. My feeling is that what approach you take depends a lot on what sort of operations you need to carry out. Databases are good at sorting, searching, grouping, and selecting data, and joining one table with another. Getting your data into a database and extracting it is always a pain, and for practical purposes we found nothing to beat converting to CSV format (comma-separated-value). We ended up using Postgres as it had the best spatial (2-d) indexing, beating My
      • Consistently (Score:4, Insightful)

        by rwa2 ( 4391 ) * on Sunday August 15, 2010 @02:37PM (#33257854) Homepage Journal

        Word up. I'd say the first goal is to store your raw, bulk data consistently. Then you can have several sets of post processing scripts that all draw from the same raw data set.

        You want this data format to be well-documented, but I wouldn't bother meticulously marking it up with XML tags and other metadata or whatever. You just want to be able to read it fast, and have other scripts be able to convert it into other formats that would be useful for analysis, be it matlab, octave, csv, or some tediously marked-up XML. You do want to be able to grep and filter the data pretty easily, so keep that in mind when you're designing the format. It will likely end up being pretty repetitive, but that's OK, since you'll likely store it compressed. That can improve performance when reading it, since the storage medium you're pulling the data from is often slower than the processor doing the decompression... and it also provides some data integrity / consistency checking. Oh, and of course, you can store more raw data if its compressed.

      • Re: (Score:3, Interesting)

        by jandersen ( 462034 )

        ...using relational databases for pretty common scientific tasks sucks badly performance-wise.

        Well, it has never been a secret that relational databases do not performs as well as e.g. a bespoke ISAM or hash-indexed data-file, mostly due to the fact that it involves interpretation of of SQL. But then the main purpose of SQL databases has never been to optimise raw performance - rather the idea is to provide maximum, logical flexibility. The beauty of relational databases is that you can change both data annd metadata at the wave of a hand, where you in the high-performing, bespoke database have to

    • "If you have something more complex than a flat file, then use relational databases. Even Access databases are better than a collection of text files."

      Ever tried to access from Access (pun intended) big blob fields?

      The guy has a lot of data. Therefore he needs a data-base. He already has one with a hierarchical storage (the filesystem) that probably conveys the way data is generated (hierarchically, if only by date).

      Then he needs to access the data by different criteria. Are they relational? Then a rela

    • http://sourceforge.net/projects/vym/ [sourceforge.net]

      I think using a flat file and this would be helpful. Maybe. I don't know exactly what the data is, however, so this may not be very helpful.

    • Can I ask a dumb question as a completely database-naive person? I'm an occasional scientist, and I have a fair bit of experimental data lying around, stored as files. I've had it suggested to me before that I clearly need to get the data into a database (and that our center in fact should force people to store their raw data this way). At the same time, my daily workflow revolves around files. I use software that opens and deals with files. Some of the software expects to find the files organized in d

  • by mangu ( 126918 ) on Sunday August 15, 2010 @12:16PM (#33257126)

    In my experience, the best thing is to let the structure stand as it was the first time you stored the data.

    Later, when you discover more and more relationships around that data, you may create, change, and destroy those symbolic links as you wish.

    I usually refrain from moving the data itself. Raw data should stand untouched, or you may delete it by mistake. Organize the links, not the data.

    • by DamonHD ( 794830 )

      Absolutely agreed.

      And indeed my biggest 'data' collection of over a decade old is all directly exposed to the Web, and other than taking care of significant errors in the original naming, I've followed the "Cool URIs Don't Change" manta and had the presentation app tie itself in knots if need be to leave the data as-is and present it in any new ways required.

      It's like having an ancient monument: you don't shuffle it around to suit the latest whim, else you'll most likely mess it up beyond repair.

      Rgds

      Damon

  • sqlite (Score:4, Insightful)

    by sugarmotor ( 621907 ) on Sunday August 15, 2010 @12:23PM (#33257162) Homepage

    http://www.sqlite.org/ [sqlite.org] a "replacement for fopen()" -- http://www.sqlite.org/about.html [sqlite.org]

  • SQL comes really handy. I can imagine several simple scripts + SQLite indexing table. Or anything else.

  • Matlab Structures (Score:4, Interesting)

    by Anonymous Coward on Sunday August 15, 2010 @12:24PM (#33257170)

    I have to organize and analyze 100 GB of data from a single day's experiment. Raw data goes on numbered HDs that are cloned and stored in separate locations. This data is the processed and extracted into about 1 GB of info. From there the relevant bits are pulled into hierarchal matlab structures. Abstract everything that you can into a MATLAB structure array. Have all your analysis methods embedded in the classes that compose the array. Use SVN to maintain your analysis code base.

    • Re:Matlab Structures (Score:5, Interesting)

      by pz ( 113803 ) on Sunday August 15, 2010 @08:09PM (#33259580) Journal

      I have to organize and analyze 100 GB of data from a single day's experiment. Raw data goes on numbered HDs that are cloned and stored in separate locations. This data is the processed and extracted into about 1 GB of info. From there the relevant bits are pulled into hierarchal matlab structures. Abstract everything that you can into a MATLAB structure array. Have all your analysis methods embedded in the classes that compose the array. Use SVN to maintain your analysis code base.

      Yes, yes, yes.

      I have very similar data collection requirements and strategy with one exception: the data that can be made human-readable in original format are made so. Always. Every original file that gets written has the read-only bit turned on (or writeable bit turned off, whichever floats your boat) as soon as it is closed. Original files are NEVER EVER DELETED and NEVER EVER MODIFIED. If a mistake is discovered requiring a modification to a file, a specially tagged version is created, but the original is never deleted or modified.

      Also, every single data file, log file, and whatever else that needs to be associated with it is named with a YYMMDD-HHMMSS- prefix and since experiments in my world are day-based, are put into a single directory called YYMMDD. I've used this system now for nearly 20 years and not screwed up with using the wrong file, yet. FIles are always named in a way that (a) doing a directory listing with alpha sort produces an ordering that makes sense and is useful, and (b) there is no doubt as to what experiment was done.

      In addition, every variable that is created in the original data files has a clear, descriptive, and somewhat verbose name that is replicated through in the MATLAB structures.

      Finally, and very importantly, the code that ran on the data collection machines is archived with each day's data set so that when bugs are discovered we can know EXACTLY which data sets were affected. As a scientist, your data files are your most valuable possessions, and need to be accorded the appropriate care. If you're doing anything ad-hoc after more than one experiment, then you aren't putting enough time into a devising a proper system.

      (I once described my data collection strategy to a scientific instrument vendor and he offered me a job on the spot.)

      I also make sure that when figures are created for my papers I've got a clear and absolutely reproducible path from the raw data to the final figures that include ZERO manual intervention. If I connect to the appropriate directory and type "make clean ; make", it may take a few hours or days to complete, but the figures will be regenerated, down to every single arrow and label. For the aspiring scientist (and all of the people working in my lab who might be reading this), this is perhaps the most important piece of advice I can give. Six months, two years, five years from now when someone asks you about a figure and you need to understand how it was created, the *only* way of knowing that these days is having a fully scripted path from raw data to final figure. Anything that required manual intervention generally cannot be proven to have been done correctly.

  • by JamesP ( 688957 ) on Sunday August 15, 2010 @12:25PM (#33257172)

    OK, subject is the short answer, here's the big answer

    Since experimental data usually doesn't have the same structure for all experiments, you may try something like this:

    at the deeper, most basic level organize it using JSON or XML (I don't know what kind of experiment you do, but you would put lists of data, etc)

    Then you store this in a NoSQL db (like CouchDb or Redis) and index it the way you like, still if you don't index you can always search it manually (slower, still...)

  • by ccleve ( 1172415 ) on Sunday August 15, 2010 @12:25PM (#33257174)

    Instead of trying to organize your data into a directory structure, use tagging instead. There's a lot of theory on this -- originally from library science, and more recently from user interface studies. The basic idea is that you often want your data to be in more than one category. In the old days, you couldn't do this, because in a library a book had to be on one and only one shelf. In this digital world you can put a book on more than one "shelf" by assigning multiple tags to it.

    Then, to find what you want, get a search engine that supports faceted navigation. [wikipedia.org]

    Four "facets" of ten nodes each have the same discriminatory power as a single hierarchy of 10,000 nodes. It's simpler, cleaner, faster, and you don't have to reorganize anything. Just be careful about how you select the facets/tags. Use some kind of controlled vocabulary, which you may already have.

    There are a bunch of companies that sell such search engines, including Dieselpoint, Endeca, Fast, etc.

    • by Yvanhoe ( 564877 )
      What ccleve said.

      I would also add that if your datasets are like mine (enormous), it may be a good idea to md5 them. Have a log file where you enter information about each file : its creation date, its md5, nature of data, source, etc...

      Do not use name or path hierarchies to keep track of metadata, it is doomed to fail. If you feel this is worth the effort you can set up a database for this info but in my opinion if you have hundreds to thousands files, a simple flat file can be good enough.
  • Instead of symlinking to directories,
    create directories of hard links to the files.

    Then you can move files around whenever you like,
    and you never have any dangling links.

    • It depends how you update the files. Many systems, when updating a file, will write the entire new file to a temporary location, then atomically rename it on top of the old location, which would kill any hardlinks, but symlinks would still work.

      I have to agree with the database suggestions, though something NoSQL-ish may work better.

    • by vrmlguy ( 120854 )

      Instead of symlinking to directories,
      create directories of hard links to the files.

      Then you can move files around whenever you like,
      and you never have any dangling links.

      I second this. I have a big collection of photos that I've downloaded over the years, and I "tag" them via hard-links into directories. The same photo may be found under "party/jane/nyc", "party/nyc/jane", "nyc/party/jane", "nyc/jane/party", "jane/party/nyc" and "jane/nyc/party". If two people are in the photo, that's twenty-four links, but I have Perl scripts that take care of the grunt work; a picture with N tags will have "just" N! links. I don't link photos to intermediate directories, but all pictu

  • ...but then google came along and taught me that it's not about know where things are, but rather about being able to find them. My email, for instance, is "organized" by the year in which it arrives, and I use the search function of my email client to find things. No big folder structure, moving messages around, and I haven't had problems finding any email I need. Oh yes, I keep them all... good fodder for "on x/x/xxx you said..." retorts.

    For files, then, the key is to have descriptive file names that p

  • Life is too short. Get someone else to do it, under the disguise of valuable field experience.

    • Re:Interns. (Score:4, Informative)

      by spacefight ( 577141 ) on Sunday August 15, 2010 @12:37PM (#33257244)
      Yeah right, let the interns do the job. Not. Interns use new tools no one understands, then finish the project during their term, then move on and let the most probably buggy or unfinished project behind. Pitty for the person who has to cleanup the mess. Better do the job on your own, know the tools or hire someone permanently for the whole deptartment.
  • The present (and the future) of experimental data organisation, repurposing, re-analysing, etc. is being shifted towards Linked Data [linkeddata.org] and supporting graph [openrdf.org] data [franz.com] stores [openlinksw.com]. Give it a spin.
    • Yes, this! I work in bioinformatics, and while relational DBMSes are used by a large number of projects, the problem you face with experimental data is that its organisation is non-obvious and relationships between different bits of data are not apparent. This means that RDBMSes aren't a natural fit. One of my colleagues has been experimenting with graph oriented databases (specifically neo4j (http://neo4j.org/)), an approach which has interesting intersections with declarative programming. In the near futu

  • I never understood how you can have something organized or not.
    I organize my stuff at the planetary level.
    Universe > Solarsystem > Earth > Contenent > United States > Florida > County > City > Street > House > Room > Desk > Computer > Hard Drive > Folder > File Type > Location
    I think im pretty well organized even though i miss place stuff all the time.

    • by TimSSG ( 1068536 )
      You forgot the Galaxy. Tim S.
      • by Rivalz ( 1431453 )

        I skip the little things... Plus I do not recognize U.S. soccer teams as a tool for organization. I had to remove Galaxy from my organizational charts when they formed the L.A. Galaxy. That little naming convention set me back 10 years worth of organization. I became confused of which galaxy I lived in and had to seek extensive therapy. I feel a relapse coming on.

        But honestly thanks for the clarification, I actually forgot a little thing like galaxies.

  • by moglito ( 1355533 ) on Sunday August 15, 2010 @12:34PM (#33257226)
    You may want to consider a scientific workflow system. These systems handle both data storage (including meta-data and provenance -- where the data came from), and design and execution of computational experiments. If you are concerned about the complexity of the meta-data (e.g., pH value..) and would like to make sure to be able sort things according to this, you want to give "Wings" a try. You can try out the sandbox to get an idea: http://wind.isi.edu/sandbox [isi.edu].
  • "Search, don't sort".

    The size and complexity of your data management should match the size and complexity of your data set. If you have thousands of datasets, give serious consideration to a relational database. Store all of your metadata (pH, date, etc) in the database so you can query it easily. If your raw data lives in a text-based format, put it in the database too, otherwise just store the path to your file in the database and keep your files in some sort of simple date-based archive or whatever.

    No

  • by arielCo ( 995647 ) on Sunday August 15, 2010 @12:35PM (#33257236)
    $PRJ_ROOT/data/theoretical
    $PRJ_ROOT/data/fits
    $PRJ_ROOT/data/doesnt_fit
    $PRJ_ROOT/data/doesnt_fit/fixed
    $PRJ_ROOT/data/made_up
    • Re: (Score:3, Funny)

      Oh, come on! Who let the climatologists in here?

  • is to use CVS (comma/tab seperated value) files to store the data. This makes it easy to import into a spreadsheet or database in the future as your needs grow.

  • by gmueckl ( 950314 ) on Sunday August 15, 2010 @12:49PM (#33257306)

    To everybody here suggesting relational databases: you are on the wrong track here, I'm afraid to tell you. Relational databases handle large sets of completely homogenious data well if you can be bothered to write software for all the I/O around them. This is where it all falls apart:

    1. Many lab experiments don't give you the exact same data every time. You often don't do the same experiment over and over. You vary it and the data handling tools have to be flexible enough to cope with that. Relational databases aren't the answer to that.

    2. Storing and fetching data through database interfaces is vastly more difficult than just using the standard input/ouput or plain text files. I've written hundreds of programs for processing experimental data and I can tell you: nothing beats plain old ASCII text files! The biggest boon is that they are compatible to almost any scientific software you can get your hands on. Your custom database likely is not. Or how would you load the contents your database table into gnuplot, Xmgrace or Origin, just to name a few tools that come to my mind right now?

    I wish I had a good answer to the problem. At times I wished for one myself, but I fear the best reply might still be "shut up and cope with it".

    • Storing and fetching data through database interfaces is vastly more difficult than just using the standard input/ouput or plain text files. I've written hundreds of programs for processing experimental data and I can tell you: nothing beats plain old ASCII text files!

      Well I believe utf-8 encoded text files beats them hands down. Especially for scientific research. //snarky

      • by jc42 ( 318812 )

        I can tell you: nothing beats plain old ASCII text files!

        Well I believe utf-8 encoded text files beats them hands down. Especially for scientific research. //snarky

        Hey, no need to be snarky about it!

        I've run into a number of tasks where we "standardized" on ASCII plain-text files, and then ran across problems like people and place names in French, Russian, Greek, and/or Chinese. It was a real relief when we officially replaced the "ASCII" with "UTF-8" throughout the docs. The only problem then was finding

    • IMHO best answer so far

      What I am looking for is a multidimensional adressing - file or database system.

      something like multiple B-tree's for the content with the possibility to add another B-tree index if required later.

      Maybe the Google people have an answer?

    • Informative?! This post is so, so terribly wrong.

      To everybody here suggesting relational databases: you are on the wrong track here, I'm afraid to tell you. Relational databases handle large sets of completely homogenious data

      Wrong. We're not on the wrong track. Databases don't only handle "homogeneous data" sets. You just don't know how to use them flexibly.

      if you can be bothered to write software for all the I/O around them

      Wrong. Databases abstract away I/O primitives and file formats, making creating/accessing your data much simpler than using (e.g.) flat text files.

      nothing beats plain old ASCII text files!

      Wrong. A great many things beat flat text files, under a great many use cases. The capabilities of (e.g.) a sqlite database are a strict (and much larger) superset of those o

      • by gmueckl ( 950314 ) on Sunday August 15, 2010 @07:12PM (#33259296)

        I hereby humbly suggest that you are, instead, wrong. Here is why:

        Scientists are not software developers and never want to be that. They want to run their experiments and analyse their data. The latter requires recording and processing of numerical data. This is where computers enter their workflow - as number crunching tools that have to be easy to use and utterly flexible.

        At times, my work consisted of writing lots of one-off C and Python program to process data in ever new ways in order to get an idea what I was actually looking at. And I had to write them myself because these weren't your run of the mill analysis steps. Many of these programs were not run again once I had their results. During all this time, I as a scientist was looking to get the data in and out of the programs in ways that are easy to code without getting distracted from what I wanted to achieve scientifically. My head was full of theory and formulas, not data structures and good software design.

        In that particular state of mind, writing SQL isn't one of the things that I would have wanted to spend any time on. The inherent complexities are a distraction and a big one at that. And, hell, I'm one of the guys who actually *know* SQL. Most scientists actually don't. Hell, many of them barely know how to use their favorite language's core libs to their advantage. They don't care and - may I say - rightly so.

        Besides, the code would get more bloated. If I want to output three values that belong together I write a print statement that places them on the same line of text in the output file and I'm done. That's a one-liner that takes me about 20 seconds to type in. In the worst case, I need to open a file beforehand and close it afterwards instead of piping it into stdout. That's maybe 3 lines of code. Now tell me: how many lines of code do I need to write to place these values in a database? That is, provided that a table already exists to hold that data.

        My point is: relational databases don't do the job for scientists. Instead, they get in the way. And you and anyone else here who is arguing in favor of them probably lack the related experience to understand that - no offense intended. The points you make are derived from pure theory. Respect the needs of the users as well, please.

        Maybe there is a middle ground here: hire a software developer who builds and maintains the DB and a nice, convenient to use wrapper library around it for you. That'll take a while and someone will have to foot the bill for it.

  • A lot depends on the type of data. If it is truly experimental results, then results could be easily organized in tables, and tables can be logically accessed, arranged and manipulated using standard rules of set theory. Relational databases work this way, but there are other approaches.

    If your data is derived or crunched, You may have a massive logic problem. See this: http://dbmsmusings.blogspot.com/2010/04/problems-with-cap-and-yahoos-little.html [blogspot.com] , and take heed.

    The previous suggestions about leaving you

    • Sorry, I forgot to include the fact that the network dbms system does not require you to rename or re-link your directory scheme. It simply creates pointers to relevant links and then maintains the pointer logic.

  • I would just put each set of experimental data in a separate subdirectory. Within each subdirectory I'd put a file with specific name (e.g., "description.txt") in which you briefly write up exactly what the experimental data is, how it was generated (e.g., if generated by a program, give the arguments and/or pointers to input data), and some keywords to allow it to be indexed/searched. Then I'd use your standard OS search tools to find the description file(s) you're looking for, thereby allowing you to loca
  • by ericbg05 ( 808406 ) on Sunday August 15, 2010 @01:10PM (#33257402)
    Others have already mentioned SQLite. Let me briefly expound on the features that are likely the most important to you, assuming (if you'll permit me) that you don't have much experience with databases:
    1. 0. The basic idea here is that you are replacing this whole hierarchy of files and directories by a single file that will contain all your data from an experiment. You figure out ahead of time what data the database will hold and specify that to SQLite. Then you to create, update, read, and destroy records as you see fit--pretty much as many records as you want. (I personally have created billions of records in a single database, though I'm sure you could make more.) Once you have records in your database, you can with great flexibility define which result sets you want from the data. SQLite will compute the result sets for you.
    2. 1. SQLite is easy to learn and use properly. This is as opposed to other database management systems, which require you to do lots of computery things that are probably overkill for you.
    3. 2. Your entire data set sits in a single file. If you're not in the middle of using the file, you can back up the database by simply copying the file somewhere else.
    4. 3. Transactions. You can wrap a large set of updates into a single "transaction". These have some nice properties that you will want:
      1. 3.1. Atomic. A transaction either fully happens or (if e.g. there was some problem) fully does not happen.
      2. 3.2. Consistent. If you write some consistency rules into your database, then those consistency rules are always satisfied after a transaction (whether or not the transaction was successful).
      3. 3.3 Isolated. (Not likely to be important to you.) If you have two programs, one writing a transaction to the database file while the other reads it, then the reader will either see the WHOLE transaction or NONE of it, even if the writer and reader are operating concurrently.
      4. 3.4. Durable. Once SQLite tells you the transaction has happened, it never "un-happens".
      5. These properties hold even if your computer loses power in the middle of the transaction.
    5. 4. Excellent scripting APIs. You are a physical sciences researcher -- in my experience this means you have at least a little knowledge of basic programming. Depending on what you're doing, this might greatly help you to get what you need out of your data set. You may have a scripting language that you prefer -- if so, it likely has a nice interface to SQLite. If you don't already know a language, I personally recommend Tcl -- it's an extremely easy language to get going with, and has tremendous support directly from the SQLite developers.

    Good luck and enjoy!

  • What about a wiki? (Score:3, Insightful)

    by gotfork ( 1395155 ) on Sunday August 15, 2010 @01:12PM (#33257416) Homepage
    In my previous lab group we used a mediawiki install to keep track of microelectronic devices that several people were working on at the same time. These devices were still under development so most of the data was qualitative -- images, profilometry data, IV/CV curves were all stored on the wiki page for each sample, and each page included a recipe for exactly how it was made, which made it easy to trouble shoot later. It worked pretty well for what we used it for, but once we had a working device all the in-depth data for that sample was kept separately. This seemed like a half-decent way of cataloging samples, although one would need something a bit more robust for complex data sets that don't integrate well with a wiki.
  • Re: (Score:2, Informative)

    Comment removed based on user account deletion
  • You need to start using a database. You don't have to actually put the data in a database but all of the meta data needs to go into one. Store your data files in one file system using whatever naming scheme you want and never move the files again. At the same time record the file system location along with all other meta data that is relevant. Then some simple database queries, e.g. embedded in some web pages can retrieve the location and even the data. You can of course also store the data in a databa

  • by wealthychef ( 584778 ) on Sunday August 15, 2010 @01:23PM (#33257478)
    If you are using Mac OS X, you can tag the files using the Finder Get Info and putting "Spotlight comments" there. Then you can easily find them based on keywords and Spotlight in constant time. The good thing about keywords is that they give you a multidimensional database effect. The bad thing I've found is I tend to forget my keywords that I'm storing stuff with, so I don't really know what to search for. OS X Spotlight is promsing and might work very well for you.
  • How CMS sorts data (Score:2, Informative)

    by toruonu ( 1696670 )

    Well CMS is one of the large experiments at the LHC. The data produced should reach pentabytes per year and add to it the simulated data we have a hellava lot of data to store and address. What we use is a logical filename (LFN) format. We have a global "filesystem" where different storage elements have files in a filesystem organized in a hierarchical subdirectory structure. As an example: /store/mc/Summer10/Wenu/GEN-SIM-RECO/START37_V5_S09-v1/0136/0400DDE2-F681-DF11-BA13-00215E21DC1E.root

    the /store is a b

  • I have seen these kinds of situations happen a lot (I'm a statistician who works on computationally-intensive physical science applications), and the best solution I have seen was a BerkeleyDB setup. One group I work with had a very, very large number of ASCII data files (order of 10-100 million) in a directory tree. One of their researchers consolidated them to a BerkeleyDB, which greatly improved data management and access. CouchDB or the like could also work, but I think the general idea of a key-value s
    • by jgrahn ( 181062 )

      I have seen these kinds of situations happen a lot (I'm a statistician who works on computationally-intensive physical science applications), and the best solution I have seen was a BerkeleyDB setup. One group I work with had a very, very large number of ASCII data files (order of 10-100 million) in a directory tree. One of their researchers consolidated them to a BerkeleyDB, which greatly improved data management and access [...] I think the general idea of a key-value store that lets you keep your data in

  • I haven't had to store experimental results like that. My work produces prototypes, some data, demos and support documentation. There are tons of KM tools out there to manage heterogenous data in a recoverable way. We've used document repositories like Hummingbird (acceptable) and of course SharePoint. The key (literally) is including the right metadata and tags when you check in the element. When a data set goes dormant (static) you can tarball the CVS tree or whatever and drop it in the repo. Then t
  • by RandCraw ( 1047302 ) on Sunday August 15, 2010 @01:28PM (#33257514)

    First I would lay out your data using meaningful labels, like a directory named for the acquisition date + machine + username. Never change this. It will always remain valid and allow you to later recover the data if other indexes are lost. Then back up this data.

    Next build indexes atop the data that semantically couple the components in the ways that are meaningful or acessible. This may manifest as indexed tables in a relational database, duplicate flat files linked by a compound naming convention, unix directory soft links, etc.

    If you're processing a lot of data, your choice of indexes may have to optimize your data access pattern rather than the data's underlying semantics. Optimize your data organization for whatever is your weakest link: analysis runtime, memory footprint, index complexity, frequent data additions or revisions, etc.

    In a second repository, maintain a precise record of your indexing scheme, and ideally, the code that automatically re-generates it. This way you (or someone else) can rebuild lost databases/indexes without repeating all your design and data cleansing decisions, and domain expertise. This info is often stored in a lab notebook (or nowadays in an 'electronic lab notebook').

    I'd emphasize that if you can't remember how your data is laid out or pre-conditioned, your analysis of it may be invalid or unrepeatable. Be systematic, simple, obvious, and keep records.

  • You need a LIMS (Score:2, Insightful)

    by pigreco314 ( 194375 )

    A Laboratory Information Management System [wikipedia.org] will help you store, organize, analyze and data-mine your data.

  • It seems you have never heard of LIMS (Laboratory Information Systems), which is unfortunate.

    This is a thriving software sector, and you are actually expected to be at least vaguely familiar with these kind of systems should you ever transfer to industry and work in data-generating or data-processing positions.

    Nobody in industry keeps experimental data as individual, handcrafted datasets. The risk of losing important data, not not being able to make cross-references (patents!) is much too high if you let pe

  • I used to have a two word answer for this question: Use BeOS

    But now it's a six word answer (*sigh*): Invent time machine, then use BeOS

    --MarkusQ

    • by yyxx ( 1812612 )

      Wrong two word answer. I seriously doubt BeOS was ever used much for scientific research.

      The correct two word answer is: "lab notebook".

    • Seven words: Invent backwards time machine, then use BeOS.
  • It's already been said, but it bears saying again. Directories and symlinks.... oh my!

  • Here's what I do:

    Directory for each data set, labeled by date (20100815).
    Short README file inside each directory with description of the run.
    Big spreadsheet (or database, if you're fancy) with experimental parameters and core results, that can be sorted, reorganized, and graphed.

  • by rsborg ( 111459 ) on Sunday August 15, 2010 @03:22PM (#33258104) Homepage
    Document Management Systems [wikipedia.org] are great - they combine (some of) the benefits of source control, file systems, and email (collaboration).

    I would recommend just downloading a VM or cloud image of something like Knowledge Tree or Alfresco (I personally prefer Alfresco), and run it on the free vmwareplayer or a real VM solution if you have one.

    I recently setup a demo showing the benefits of such a system, I was able to, in about one day, download and setup Alfresco, expose CIFS interface (ie, \\192.168.x.x\documents) and just dump a portion of my entire document base into the system. After digestion, the system had all the documents full-text-indexed (yes, even word docs and excel files thanks to OpenOffice libraries), and I could go about changing directory structure, moving around and renaming files, etc. .. and the source control would show me changes. In fact, I could go into the backend and write SQL queries if I wanted to with detailed reports of how things were on date X or Y revisions ago. Was quite sweet. All the while, the users still saw the same windows directory structure and modifications they made there would be versioned and modified in Alfresco's database.

    Here is a bitnami VM image [bitnami.org], will save you days of configuration. If the solution works for you, but is slow, just DL the native stack and migrate or re-import.

  • A personal wiki that runs from one file, I link my files from there and I can add documentation and references at will. http://www.tiddlywiki.com/ [tiddlywiki.com]

  • Just give all your datasets a number and put them in a database so you can search on all criteria you want.
    You can also use Excel to keep track of your metadata if you want...

  • Great comments (Score:3, Insightful)

    by gringer ( 252588 ) on Sunday August 15, 2010 @07:15PM (#33259314)

    Reading these comments has changed my thoughts on data storage a little bit, but has reinforced my idea that databases are a bad idea for this sort of thing.

    The main issues I have with using databases are file size (I store and convert text files that are 10-100MB zipped), and mutability (generated data doesn't typically change, I just add new experiments on top of other data). A secondary issue is that for plain-text data files (or plain-text convertible data files), writing code is easier when you don't have to bother about a database middleman.

    So, if I were to do [another] large research project in the future, here's my thoughts on what I would consider an appropriate approach:

    • Use a file system, rather than a database
    • New data gets put in a consistent place, e.g /data/<date>/<source>/
    • Backup this data. Assuming immutable files, incremental backups don't make much sense.
    • Symlink categories to the original data
    • When a file is referenced (or attached) in an email, keep a link between email date and file, e.g. /categories/<email>/<date>/<sink>/
    • Maintain (preferably autogenerate) and backup a plain-text file linking categories to files. This will help when data gets lost (i.e. accidentally deleted).

    My most common uses for old data are re-running analyses (generating new data as results), and sending data to someone else. It helps to be able to make those things as quick as possible.

  • by floydman ( 179924 ) <floydman@gmail.com> on Sunday August 15, 2010 @08:00PM (#33259522)

    I am a programmer, who works closely with scientists in scientific computing in the fields of fluid mechanics simulation, and aerodynamics simulation.
    Your question is really not clear, in both these fields that I work on, the requirements vary vastly, and it also varies to the users I support (over 100 scientist). some of them have huge data sets, spanning up to 600 GB/file, a single simulation run can give a geologist a 1 TB file.
    Others, have a few hundred MB of data. Each is handled differently.
    The data itself, can be parsed and stored in in a DB for analysis in some cases, and in others, that is very impractical and will slow down your work.
    Each scientist has a different way of doing things.

    So the bottom line, if you want any useful answers, be more specific. What field of science (i can tell you are a chemist?), what simulations/tests do you use, how fine are your models are your data sets and what is their format, what kind of data are you interested in, you should seriously consider an archiving solution because i guarantee you will run out of space.

Our OS who art in CPU, UNIX be thy name. Thy programs run, thy syscalls done, In kernel as it is in user!

Working...