How Do You Organize Your Experimental Data? 235
digitalderbs writes "As a researcher in the physical sciences, I have generated thousands of experimental datasets that need to be sorted and organized — a problem which many of you have had to deal with as well, no doubt. I've sorted my data with an elaborate system of directories and symbolic links to directories that sort by sample, pH, experimental type, and other qualifiers, but I've found that through the years, I've needed to move, rename, and reorganize these directories and links, which have left me with thousands of dangling links and a heterogeneous naming scheme. What have you done to organize, tag and add metadata to your data, and how have you dealt with redirecting thousands of symbolic links at a time?"
Here (Score:2, Funny)
Use databases! (Score:4, Insightful)
Subj.
If you have something more complex than a flat file, then use relational databases. Even Access databases are better than a collection of text files.
Re: (Score:3, Insightful)
If you have something more complex than a flat file, then use relational databases. Even Access databases are better than a collection of text files.
That really depends on what your intended use for them is. I mean I don't know this particular fellow's situation for data collection or what tools he uses for reporting and visualization but perhaps, for him, it's a much better idea to store them in flat files. Me? I have been using flat files for all my data collection about local crime (see here [lazylightning.org], here [lazylightning.org], here [lazylightning.org],
Re:Use databases! (maybe, maybe not) (Score:2)
I agree. It depends.
Yes, relational databases store and retrieve well-defined data very, very well. Do you have referential integrity needs?. If that's your situation, use SQLite (small data and very simple types but little referential integrity), MySQL (medium to large data), or PostgreSQL (medium to very large data or more complex data types) and don't look back. SQL queries, relationships, and referential integrity are very powerful.
If not, then I'd look at MongoDB with GridFS. I'd even go further a
Re: (Score:2)
One can mix-n-match: use flat files to store raw data, post-processed stuff, etc, but use a database to keep track of everything. The latter can even be handled by your OS from the get-go. On OS X you could write a spotlight plugin or two for your data files, and as long as the file format allows storing metadata within the files, spotlight will index it. Same goes for Windows Search. You could also use native mechanisms for adding metadata to files - those exist IIRC on both Windows 7 and on OS X.
Re: (Score:3, Informative)
Yes, agreed, a combination is good (SQL + NoSQL + filesystem).
There is no one-size-fits-all scenario, here.
However, there is utility in a NoSQL database over a raw filesystem. One feature is indexed search. Another is versioning. Another is the fact that it is extremely multiuser (proper record locking, even if there are multiple writes to the same record). Also, many NoSQL databases (especially MongoDB) have built-in replication, sharding, Map-Reduce, and horizontal scaling.
MongoDB's GridFS (especially
Re:Use databases! (Score:4, Interesting)
Certainly it depends, YMMV, and all that. Still, I think that some of the points that you bring up are not actually arguing against a relational database, perhaps just for a slight reorganization of your processes.
I agree with your final point though, we really have no idea what would be best for the submitter within the limitations of skills, budget, time, etc. Perhaps flat files are really the best solutions, or maybe stone tablets are.
Re: (Score:3, Insightful)
The question he is really asking is probably more like: 'Is there anything like Windows live photo gallery that will allow me to tag and sort all of my data in a variety of ways like wlpg lets me sort my pictures?'
Re: (Score:2)
The question he is really asking is probably more like: 'Is there anything like Windows live photo gallery that will allow me to tag and sort all of my data in a variety of ways like wlpg lets me sort my pictures?'
That is what Nepomuk is supposed to do: allow you to build semantic meaning into your data. The problem I have with it as it current stands is that the Nepomuk processes suck the life out of my computers, so I disable Nepomuk entirely.
Re: (Score:2, Funny)
Re: (Score:2, Informative)
I'm not quite sure why your post is rated as funny; scientists are not necessarily the best people to be left in charge of setting up databases. I've seen all sorts of atrocities constructed in the name of science, from vast flat files to cross-linked ISAM/VSAM files, and I remember many late nights (with complaints from wife) spent sorting them out when a subscript went out of
Re:Use databases! (Score:4, Insightful)
Not all IT is the same -- you want 'Informatics' (Score:4, Interesting)
The problem is, most IT people have no idea what do with science data -- it'd be like going to a dentist because you're having a heart attack. They might be able to give general advice, but have clue what specifics need to be done. Likewise, IT might be people who are really good at diagnosing hardware, but they might suck at writing code. Not all IT specialists are cross-trained in enough topics to deal with this issue effectively (data modeling, UIs, database admin, programming, and the science discipline itself).
There's a field out there called 'Science Informatics'. It's not a very large group, but there's a number of us who specialize in helping scientists organize, find, and generally manage data. Think of us as librarians for science data.
Most of us would even be willing to give advice to people outside our place of work, as the better organized science data is in general, the more usable and re-usable it is. There's even a number of efforts to have people publish data, so it can be shared, verified, etc. And most of us have a programming background, so we might be able to share code with you, as we try to make it open source where we can, so we don't all have to re-solve the same problems.
Because each discipline varies so much, both in how they think about their data, and what their ultimate needs are, we tend to be specialists, but there's a number of different groups out there, for example:
There's also Bioinformatics, Health/medical informatics, chemical informatics, etc. plug in your science discipline + 'informatics' into your favorite search engine, and odds are you'll find a group, or person you can write to to try to get more info and advice.
Recently, NSF just funded a few more groups to try to build out systems and communities : DataOne [dataone.org] and the Data Conservancy [dataconservancy.org], and I believe there's some more money still to be awarded.
Re: (Score:2)
Funny indeed. Having worked at both the university and National Lab levels, the IT departments I've encountered have more of the function of pushing so many restrictive security policies and pieces of corporate spyware to everyone's systems than of actually enabling researchers to be more productive...
Really, this is exactly what WinFS would've been perfect for, if MS had ever gotten it working and released it. As it is, I use hierarchical directory structure - top level is the research project, then date/
Re: (Score:2)
I'm amused by all the /.ers suggesting the physics nerd setup a SQL database. It ain't super easy
one SQL database? As in, one ginamic SQL database? yeah, that's a headache.
OTOH, one SQL database per "type" of data (i.e., "last summer's research.sql")? Hell yes. If it's not in a database, you need to spend the ten minutes to learn how to make one. MS Access exists for a reason, and this is it. It won't be a very pretty db, but it doesn't need to be -- it'll be indexed, searchable, and more or less protected.
The only reason NOT to put it into a relational database is if you have some system that's
Re: (Score:3, Insightful)
MySQL: click the install button (use default options, they're all fine)
phpMyAdmin: put in document root, configure ("click the install button")
And you're set. How was that hard...?
Some software is, in fact, difficult to set up and maintain. As a scientist with an unusually large sample collection, learnin
Re: (Score:2)
Any physics nerd should be able to set up a MySQL database pretty easily-- it's not quite as easy as falling out of a tree, but it's not anywhere near as difficult as a lot of other things in physics. A great deal of data acquisition and analysis for many (if not most) physical scientists involves a bunch of custom programming, and many of the theoretical sorts do a lot of computer modeling. MySQL is pretty easy to install on just about anything, and if you have a reasonable idea of what your data will lo
Re: (Score:2)
Umm, what? Any decent scientist these days should be able to program. Setting up an SQL database takes next to no time, as long as you use the right tools to do it. pgAdmin comes to mind, or oo.ORG database frontend. Now the problem may be that many scientists and engineers still don't know how to program, but that's another story.
Databases are not as convenient as files (Score:3, Interesting)
I agree that this is a candidate for a database. One problem with data bases for researchers is that generally one does not know the right schema before hand ond one is dealing with ad hoc contingencies a lot. Another is portability to machines you don't control or that are not easily networked. A final problem is archival persistence. I can't think of a single data base scheme that has lasted 5 let alone ten years and still function without being maintained. Files can do that.
So if you want some banda
Re: (Score:2)
I agree that this is a candidate for a database. One problem with data bases for researchers is that generally one does not know the right schema before hand ond one is dealing with ad hoc contingencies a lot. Another is portability to machines you don't control or that are not easily networked. A final problem is archival persistence. I can't think of a single data base scheme that has lasted 5 let alone ten years and still function without being maintained. Files can do that.
The article didn't actually ask for a way to organize data, he asked for a way to organize files. He could easily create a database that points to whatever files he wants. Populate the relevant columns, and if he wants to add another type of data to search on, add a column. It's not like you need data in every cell, you simply need data in the cells that you want to find again.
Re: (Score:2)
The article didn't actually ask for a way to organize data, he asked for a way to organize files. He could easily create a database that points to whatever files he wants. Populate the relevant columns, and if he wants to add another type of data to search on, add a column. It's not like you need data in every cell, you simply need data in the cells that you want to find again.
And right now he's using a complicated mess of symlinks that amounts to a db schema that's probably a huge pain to maintain. Pick one straightforward way to organize files (e.g. date, with directories by month or something) and use the db for sifting through them to pick files by pH, lunar phase, and hair color (or whatever).
Re:Use databases! (Score:5, Interesting)
I've recently made a comparison of MySQL 5.0, Oracle 10i and HDF5 file based data storage for our space data. The results [google.com] are amusing (the linked page contains charts and explanations; pay attention to the conclusive chart, it is the most informative one). In short (for those who don't want to look at the charts): using relational databases for pretty common scientific tasks sucks badly performance-wise.
Disclaimer: while I'm positively not a guru DBA and thus admit that both of the databases tested could be configured and optimized better, but the thing is that I am not supposed to. Neither is the OP. While we may do lots of programming work to accomplish our scientific tasks, being a qualified DBA is a completely separate challenge - an unwanted one, as well.
So far, PyTables/HDF5 FTW. Which brings us back to the OP's question about organizing these files...
Re: (Score:3, Insightful)
Translation: I am not a DB guru, but I deal with massive amounts of complex data and need a DB guru, but I have no intent on hiring one.
Seriously, hire a DB wizard in the DB software of your choice for a couple of days. Have him setup the data and optimize it. You'll save yourself a lot of headaches, AND put yourself in a good position for future data maintenance. Imagine that your project gets a lot of attention in the future, and you suddenly get a lot of funding and the money to hire more people Or imagi
Re:Use databases! (Score:4, Interesting)
I've built LIMS systems that manage peta-bytes of data for companies and small scale data management solutions for my own research. Regardless the scale, the same basic pattern applies: Databases + files + programming languages. Put your meta-data and experimental conditions in the database. This makes it easy to manage and find your data. Keep the raw data in files. Keep the data in a simple directory structure (I like instrument_name/project_name/date type heiracchies, but it doesn't really matter what you do as long as you're consistent) and keep pointers to the files in the database. Use Python/Perl/Erlang/R/Haskell/C/whatever-floats-your-boat for analysis
Databases are great tools when used properly. They're terrible tools when you try to shoehorn all your analysis into them. It's unfortunate that so few scientists (computer and other) understand how to use them. Also, for most scientific projects, SQLite is all you need for managing meta-data. Anything else and you'll be tempted to to your analysis in the database. Basic database design is not difficult to learn - it's about the same as learning to use a scripting language for basic analysis tasks.
The main points:
1) Use databases for what they're good at: managing meta-data and relations.
2) Use programming languages for what they're good at: analysis.
3) Use file systems for what they're good at: storing raw data.
-Chris
Re: (Score:2)
Depending on the size and stability of the GPs research budget, that may not be practical. I worked on a fairly large academic research team (by EE/CS standards) that had the budget to hire a few full-time staff members for certain things. After the main implementation push the project wound down a bit, and those staff moved on to other jobs, leaving the grad students to maintain the infrastructure. That was fine as it was, but could have been massively not-fine if the staff had used complex tools that r
Re: (Score:2)
That's where documentation comes handy. They did document the system enough so that a less-than-specialist could still keep it running, right?
Re: (Score:2)
But I don't think documentation is a panacea if the tool used is particularly rarified. Perhaps the DBA in question (this is purely hypothetical now) set something up using Oracle, and then left. Now, maybe it's easy enough to use as the interface for SQL queries and the like, but what happens if there are major reorganizations that really do require specialized know
Re: (Score:2, Interesting)
Consistently (Score:4, Insightful)
Word up. I'd say the first goal is to store your raw, bulk data consistently. Then you can have several sets of post processing scripts that all draw from the same raw data set.
You want this data format to be well-documented, but I wouldn't bother meticulously marking it up with XML tags and other metadata or whatever. You just want to be able to read it fast, and have other scripts be able to convert it into other formats that would be useful for analysis, be it matlab, octave, csv, or some tediously marked-up XML. You do want to be able to grep and filter the data pretty easily, so keep that in mind when you're designing the format. It will likely end up being pretty repetitive, but that's OK, since you'll likely store it compressed. That can improve performance when reading it, since the storage medium you're pulling the data from is often slower than the processor doing the decompression... and it also provides some data integrity / consistency checking. Oh, and of course, you can store more raw data if its compressed.
Re: (Score:3, Interesting)
...using relational databases for pretty common scientific tasks sucks badly performance-wise.
Well, it has never been a secret that relational databases do not performs as well as e.g. a bespoke ISAM or hash-indexed data-file, mostly due to the fact that it involves interpretation of of SQL. But then the main purpose of SQL databases has never been to optimise raw performance - rather the idea is to provide maximum, logical flexibility. The beauty of relational databases is that you can change both data annd metadata at the wave of a hand, where you in the high-performing, bespoke database have to
Re: (Score:2)
"If you have something more complex than a flat file, then use relational databases. Even Access databases are better than a collection of text files."
Ever tried to access from Access (pun intended) big blob fields?
The guy has a lot of data. Therefore he needs a data-base. He already has one with a hierarchical storage (the filesystem) that probably conveys the way data is generated (hierarchically, if only by date).
Then he needs to access the data by different criteria. Are they relational? Then a rela
Re: (Score:2)
http://sourceforge.net/projects/vym/ [sourceforge.net]
I think using a flat file and this would be helpful. Maybe. I don't know exactly what the data is, however, so this may not be very helpful.
Re: (Score:2)
Nice idea
Nice service you have advertised in your sig
Even Nicer referral link
Re: (Score:2)
Can I ask a dumb question as a completely database-naive person? I'm an occasional scientist, and I have a fair bit of experimental data lying around, stored as files. I've had it suggested to me before that I clearly need to get the data into a database (and that our center in fact should force people to store their raw data this way). At the same time, my daily workflow revolves around files. I use software that opens and deals with files. Some of the software expects to find the files organized in d
Separate data from presentation (Score:5, Informative)
In my experience, the best thing is to let the structure stand as it was the first time you stored the data.
Later, when you discover more and more relationships around that data, you may create, change, and destroy those symbolic links as you wish.
I usually refrain from moving the data itself. Raw data should stand untouched, or you may delete it by mistake. Organize the links, not the data.
Re: (Score:2)
Absolutely agreed.
And indeed my biggest 'data' collection of over a decade old is all directly exposed to the Web, and other than taking care of significant errors in the original naming, I've followed the "Cool URIs Don't Change" manta and had the presentation app tie itself in knots if need be to leave the data as-is and present it in any new ways required.
It's like having an ancient monument: you don't shuffle it around to suit the latest whim, else you'll most likely mess it up beyond repair.
Rgds
Damon
sqlite (Score:4, Insightful)
http://www.sqlite.org/ [sqlite.org] a "replacement for fopen()" -- http://www.sqlite.org/about.html [sqlite.org]
Databases (Score:2)
SQL comes really handy. I can imagine several simple scripts + SQLite indexing table. Or anything else.
Matlab Structures (Score:4, Interesting)
I have to organize and analyze 100 GB of data from a single day's experiment. Raw data goes on numbered HDs that are cloned and stored in separate locations. This data is the processed and extracted into about 1 GB of info. From there the relevant bits are pulled into hierarchal matlab structures. Abstract everything that you can into a MATLAB structure array. Have all your analysis methods embedded in the classes that compose the array. Use SVN to maintain your analysis code base.
Re:Matlab Structures (Score:5, Interesting)
I have to organize and analyze 100 GB of data from a single day's experiment. Raw data goes on numbered HDs that are cloned and stored in separate locations. This data is the processed and extracted into about 1 GB of info. From there the relevant bits are pulled into hierarchal matlab structures. Abstract everything that you can into a MATLAB structure array. Have all your analysis methods embedded in the classes that compose the array. Use SVN to maintain your analysis code base.
Yes, yes, yes.
I have very similar data collection requirements and strategy with one exception: the data that can be made human-readable in original format are made so. Always. Every original file that gets written has the read-only bit turned on (or writeable bit turned off, whichever floats your boat) as soon as it is closed. Original files are NEVER EVER DELETED and NEVER EVER MODIFIED. If a mistake is discovered requiring a modification to a file, a specially tagged version is created, but the original is never deleted or modified.
Also, every single data file, log file, and whatever else that needs to be associated with it is named with a YYMMDD-HHMMSS- prefix and since experiments in my world are day-based, are put into a single directory called YYMMDD. I've used this system now for nearly 20 years and not screwed up with using the wrong file, yet. FIles are always named in a way that (a) doing a directory listing with alpha sort produces an ordering that makes sense and is useful, and (b) there is no doubt as to what experiment was done.
In addition, every variable that is created in the original data files has a clear, descriptive, and somewhat verbose name that is replicated through in the MATLAB structures.
Finally, and very importantly, the code that ran on the data collection machines is archived with each day's data set so that when bugs are discovered we can know EXACTLY which data sets were affected. As a scientist, your data files are your most valuable possessions, and need to be accorded the appropriate care. If you're doing anything ad-hoc after more than one experiment, then you aren't putting enough time into a devising a proper system.
(I once described my data collection strategy to a scientific instrument vendor and he offered me a job on the spot.)
I also make sure that when figures are created for my papers I've got a clear and absolutely reproducible path from the raw data to the final figures that include ZERO manual intervention. If I connect to the appropriate directory and type "make clean ; make", it may take a few hours or days to complete, but the figures will be regenerated, down to every single arrow and label. For the aspiring scientist (and all of the people working in my lab who might be reading this), this is perhaps the most important piece of advice I can give. Six months, two years, five years from now when someone asks you about a figure and you need to understand how it was created, the *only* way of knowing that these days is having a fully scripted path from raw data to final figure. Anything that required manual intervention generally cannot be proven to have been done correctly.
Go for NoSQL! (Score:4, Funny)
OK, subject is the short answer, here's the big answer
Since experimental data usually doesn't have the same structure for all experiments, you may try something like this:
at the deeper, most basic level organize it using JSON or XML (I don't know what kind of experiment you do, but you would put lists of data, etc)
Then you store this in a NoSQL db (like CouchDb or Redis) and index it the way you like, still if you don't index you can always search it manually (slower, still...)
Don't bother with hierarchies (Score:5, Interesting)
Instead of trying to organize your data into a directory structure, use tagging instead. There's a lot of theory on this -- originally from library science, and more recently from user interface studies. The basic idea is that you often want your data to be in more than one category. In the old days, you couldn't do this, because in a library a book had to be on one and only one shelf. In this digital world you can put a book on more than one "shelf" by assigning multiple tags to it.
Then, to find what you want, get a search engine that supports faceted navigation. [wikipedia.org]
Four "facets" of ten nodes each have the same discriminatory power as a single hierarchy of 10,000 nodes. It's simpler, cleaner, faster, and you don't have to reorganize anything. Just be careful about how you select the facets/tags. Use some kind of controlled vocabulary, which you may already have.
There are a bunch of companies that sell such search engines, including Dieselpoint, Endeca, Fast, etc.
Re: (Score:2)
I would also add that if your datasets are like mine (enormous), it may be a good idea to md5 them. Have a log file where you enter information about each file : its creation date, its md5, nature of data, source, etc...
Do not use name or path hierarchies to keep track of metadata, it is doomed to fail. If you feel this is worth the effort you can set up a database for this info but in my opinion if you have hundreds to thousands files, a simple flat file can be good enough.
Use hard links (Score:2)
Instead of symlinking to directories,
create directories of hard links to the files.
Then you can move files around whenever you like,
and you never have any dangling links.
Careful... (Score:2)
It depends how you update the files. Many systems, when updating a file, will write the entire new file to a temporary location, then atomically rename it on top of the old location, which would kill any hardlinks, but symlinks would still work.
I have to agree with the database suggestions, though something NoSQL-ish may work better.
Re: (Score:2)
Instead of symlinking to directories,
create directories of hard links to the files.
Then you can move files around whenever you like,
and you never have any dangling links.
I second this. I have a big collection of photos that I've downloaded over the years, and I "tag" them via hard-links into directories. The same photo may be found under "party/jane/nyc", "party/nyc/jane", "nyc/party/jane", "nyc/jane/party", "jane/party/nyc" and "jane/nyc/party". If two people are in the photo, that's twenty-four links, but I have Perl scripts that take care of the grunt work; a picture with N tags will have "just" N! links. I don't link photos to intermediate directories, but all pictu
I used to be anal about organization... (Score:2, Insightful)
...but then google came along and taught me that it's not about know where things are, but rather about being able to find them. My email, for instance, is "organized" by the year in which it arrives, and I use the search function of my email client to find things. No big folder structure, moving messages around, and I haven't had problems finding any email I need. Oh yes, I keep them all... good fodder for "on x/x/xxx you said..." retorts.
For files, then, the key is to have descriptive file names that p
Re: (Score:2)
Interns. (Score:2)
Life is too short. Get someone else to do it, under the disguise of valuable field experience.
Re:Interns. (Score:4, Informative)
Linked Data, of course (Score:2, Informative)
Re: (Score:2)
Yes, this! I work in bioinformatics, and while relational DBMSes are used by a large number of projects, the problem you face with experimental data is that its organisation is non-obvious and relationships between different bits of data are not apparent. This means that RDBMSes aren't a natural fit. One of my colleagues has been experimenting with graph oriented databases (specifically neo4j (http://neo4j.org/)), an approach which has interesting intersections with declarative programming. In the near futu
How can you not? (Score:2)
I never understood how you can have something organized or not.
I organize my stuff at the planetary level.
Universe > Solarsystem > Earth > Contenent > United States > Florida > County > City > Street > House > Room > Desk > Computer > Hard Drive > Folder > File Type > Location
I think im pretty well organized even though i miss place stuff all the time.
Re: (Score:2)
Re: (Score:2)
I skip the little things... Plus I do not recognize U.S. soccer teams as a tool for organization. I had to remove Galaxy from my organizational charts when they formed the L.A. Galaxy. That little naming convention set me back 10 years worth of organization. I became confused of which galaxy I lived in and had to seek extensive therapy. I feel a relapse coming on.
But honestly thanks for the clarification, I actually forgot a little thing like galaxies.
Try using a scientific workflow system (Score:3, Insightful)
Be like Google... (Score:2)
"Search, don't sort".
The size and complexity of your data management should match the size and complexity of your data set. If you have thousands of datasets, give serious consideration to a relational database. Store all of your metadata (pH, date, etc) in the database so you can query it easily. If your raw data lives in a text-based format, put it in the database too, otherwise just store the path to your file in the database and keep your files in some sort of simple date-based archive or whatever.
No
four directories (Score:5, Funny)
$PRJ_ROOT/data/fits
$PRJ_ROOT/data/doesnt_fit
$PRJ_ROOT/data/doesnt_fit/fixed
$PRJ_ROOT/data/made_up
Re: (Score:3, Funny)
Oh, come on! Who let the climatologists in here?
Re: (Score:2)
(I posted AC above by mistake):
$ ls -lrt $PRJ_ROOT/data
total 5
-rw-r--r-- 1 geek phd 4096 1998-04-15 15:16 theoretical
-rw-r--r-- 1 geek phd 4096 2000-10-01 17:20 doesnt_fit
-rw-r--r-- 1 geek phd 4096 2006-06-29 22:17 doesnt_fit/fixed
lrwxrwxrwx 1 geek phd 11 2007-03-12 23:03 fits -> theoretical
-rw-r--r-- 1 geek phd 65536 2009-02-17 23:33 made_up
The Obvious Solution (Score:2)
is to use CVS (comma/tab seperated value) files to store the data. This makes it easy to import into a spreadsheet or database in the future as your needs grow.
Re: (Score:2)
I think you mean CSV
Re: (Score:2)
Relational Databases won't do! (Score:4, Informative)
To everybody here suggesting relational databases: you are on the wrong track here, I'm afraid to tell you. Relational databases handle large sets of completely homogenious data well if you can be bothered to write software for all the I/O around them. This is where it all falls apart:
1. Many lab experiments don't give you the exact same data every time. You often don't do the same experiment over and over. You vary it and the data handling tools have to be flexible enough to cope with that. Relational databases aren't the answer to that.
2. Storing and fetching data through database interfaces is vastly more difficult than just using the standard input/ouput or plain text files. I've written hundreds of programs for processing experimental data and I can tell you: nothing beats plain old ASCII text files! The biggest boon is that they are compatible to almost any scientific software you can get your hands on. Your custom database likely is not. Or how would you load the contents your database table into gnuplot, Xmgrace or Origin, just to name a few tools that come to my mind right now?
I wish I had a good answer to the problem. At times I wished for one myself, but I fear the best reply might still be "shut up and cope with it".
Re: (Score:2)
Storing and fetching data through database interfaces is vastly more difficult than just using the standard input/ouput or plain text files. I've written hundreds of programs for processing experimental data and I can tell you: nothing beats plain old ASCII text files!
Well I believe utf-8 encoded text files beats them hands down. Especially for scientific research. //snarky
Re: (Score:2)
Hey, no need to be snarky about it!
I've run into a number of tasks where we "standardized" on ASCII plain-text files, and then ran across problems like people and place names in French, Russian, Greek, and/or Chinese. It was a real relief when we officially replaced the "ASCII" with "UTF-8" throughout the docs. The only problem then was finding
Re:RDB won't do! (Score:2)
What I am looking for is a multidimensional adressing - file or database system.
something like multiple B-tree's for the content with the possibility to add another B-tree index if required later.
Maybe the Google people have an answer?
Re: (Score:2)
To everybody here suggesting relational databases: you are on the wrong track here, I'm afraid to tell you. Relational databases handle large sets of completely homogenious data
Wrong. We're not on the wrong track. Databases don't only handle "homogeneous data" sets. You just don't know how to use them flexibly.
if you can be bothered to write software for all the I/O around them
Wrong. Databases abstract away I/O primitives and file formats, making creating/accessing your data much simpler than using (e.g.) flat text files.
nothing beats plain old ASCII text files!
Wrong. A great many things beat flat text files, under a great many use cases. The capabilities of (e.g.) a sqlite database are a strict (and much larger) superset of those o
Re:Relational Databases won't do! (Score:4, Insightful)
I hereby humbly suggest that you are, instead, wrong. Here is why:
Scientists are not software developers and never want to be that. They want to run their experiments and analyse their data. The latter requires recording and processing of numerical data. This is where computers enter their workflow - as number crunching tools that have to be easy to use and utterly flexible.
At times, my work consisted of writing lots of one-off C and Python program to process data in ever new ways in order to get an idea what I was actually looking at. And I had to write them myself because these weren't your run of the mill analysis steps. Many of these programs were not run again once I had their results. During all this time, I as a scientist was looking to get the data in and out of the programs in ways that are easy to code without getting distracted from what I wanted to achieve scientifically. My head was full of theory and formulas, not data structures and good software design.
In that particular state of mind, writing SQL isn't one of the things that I would have wanted to spend any time on. The inherent complexities are a distraction and a big one at that. And, hell, I'm one of the guys who actually *know* SQL. Most scientists actually don't. Hell, many of them barely know how to use their favorite language's core libs to their advantage. They don't care and - may I say - rightly so.
Besides, the code would get more bloated. If I want to output three values that belong together I write a print statement that places them on the same line of text in the output file and I'm done. That's a one-liner that takes me about 20 seconds to type in. In the worst case, I need to open a file beforehand and close it afterwards instead of piping it into stdout. That's maybe 3 lines of code. Now tell me: how many lines of code do I need to write to place these values in a database? That is, provided that a table already exists to hold that data.
My point is: relational databases don't do the job for scientists. Instead, they get in the way. And you and anyone else here who is arguing in favor of them probably lack the related experience to understand that - no offense intended. The points you make are derived from pure theory. Respect the needs of the users as well, please.
Maybe there is a middle ground here: hire a software developer who builds and maintains the DB and a nice, convenient to use wrapper library around it for you. That'll take a while and someone will have to foot the bill for it.
Two approaches... (Score:2)
A lot depends on the type of data. If it is truly experimental results, then results could be easily organized in tables, and tables can be logically accessed, arranged and manipulated using standard rules of set theory. Relational databases work this way, but there are other approaches.
If your data is derived or crunched, You may have a massive logic problem. See this: http://dbmsmusings.blogspot.com/2010/04/problems-with-cap-and-yahoos-little.html [blogspot.com] , and take heed.
The previous suggestions about leaving you
Re:Two approaches...OOPS! (Score:2)
Sorry, I forgot to include the fact that the network dbms system does not require you to rename or re-link your directory scheme. It simply creates pointers to relevant links and then maintains the pointer logic.
tags and search (find/grep) (Score:2)
SQLite + Scripting language (Score:3, Informative)
Good luck and enjoy!
What about a wiki? (Score:3, Insightful)
Re: (Score:2, Informative)
A database ... (Score:2)
You need to start using a database. You don't have to actually put the data in a database but all of the meta data needs to go into one. Store your data files in one file system using whatever naming scheme you want and never move the files again. At the same time record the file system location along with all other meta data that is relevant. Then some simple database queries, e.g. embedded in some web pages can retrieve the location and even the data. You can of course also store the data in a databa
Use tags in Apple OS X (Score:3, Insightful)
Re: (Score:2)
Extremely unlikely that a scientist would be using OSX
Apparently you don't know many scientists. I work at a national laboratory for our computing center, and there are hundreds of scientists using Macs at our laboratory. I've noticed lots of the Linux heads switching to Mac. It's nice to have a sweet GUI along with a BSD Unix underpinning,
Apple does not make good cluster machines, but they make excellent desktops for doing science.
Get your head out of your bias.
I didn't know that Windows and
Re: (Score:2)
How CMS sorts data (Score:2, Informative)
Well CMS is one of the large experiments at the LHC. The data produced should reach pentabytes per year and add to it the simulated data we have a hellava lot of data to store and address. What we use is a logical filename (LFN) format. We have a global "filesystem" where different storage elements have files in a filesystem organized in a hierarchical subdirectory structure. As an example: /store/mc/Summer10/Wenu/GEN-SIM-RECO/START37_V5_S09-v1/0136/0400DDE2-F681-DF11-BA13-00215E21DC1E.root
the /store is a b
Another vote for NoSQL and some experience (Score:2, Informative)
Re: (Score:2)
Knowledge Management Tools (Score:2)
First devise a meaningful stable primary key (Score:3, Informative)
First I would lay out your data using meaningful labels, like a directory named for the acquisition date + machine + username. Never change this. It will always remain valid and allow you to later recover the data if other indexes are lost. Then back up this data.
Next build indexes atop the data that semantically couple the components in the ways that are meaningful or acessible. This may manifest as indexed tables in a relational database, duplicate flat files linked by a compound naming convention, unix directory soft links, etc.
If you're processing a lot of data, your choice of indexes may have to optimize your data access pattern rather than the data's underlying semantics. Optimize your data organization for whatever is your weakest link: analysis runtime, memory footprint, index complexity, frequent data additions or revisions, etc.
In a second repository, maintain a precise record of your indexing scheme, and ideally, the code that automatically re-generates it. This way you (or someone else) can rebuild lost databases/indexes without repeating all your design and data cleansing decisions, and domain expertise. This info is often stored in a lab notebook (or nowadays in an 'electronic lab notebook').
I'd emphasize that if you can't remember how your data is laid out or pre-conditioned, your analysis of it may be invalid or unrepeatable. Be systematic, simple, obvious, and keep records.
Re: (Score:2)
You need a LIMS (Score:2, Insightful)
A Laboratory Information Management System [wikipedia.org] will help you store, organize, analyze and data-mine your data.
LIMS! This is a no-brainer! (Score:2)
It seems you have never heard of LIMS (Laboratory Information Systems), which is unfortunate.
This is a thriving software sector, and you are actually expected to be at least vaguely familiar with these kind of systems should you ever transfer to industry and work in data-generating or data-processing positions.
Nobody in industry keeps experimental data as individual, handcrafted datasets. The risk of losing important data, not not being able to make cross-references (patents!) is much too high if you let pe
Used to be two-word answer (Score:2)
I used to have a two word answer for this question: Use BeOS
But now it's a six word answer (*sigh*): Invent time machine, then use BeOS
--MarkusQ
Re: (Score:2)
Wrong two word answer. I seriously doubt BeOS was ever used much for scientific research.
The correct two word answer is: "lab notebook".
Re: (Score:2)
Learn a Relational Database! (Score:2)
It's already been said, but it bears saying again. Directories and symlinks.... oh my!
By date, external file for metadata. (Score:2)
Here's what I do:
Directory for each data set, labeled by date (20100815).
Short README file inside each directory with description of the run.
Big spreadsheet (or database, if you're fancy) with experimental parameters and core results, that can be sorted, reorganized, and graphed.
SciDB, Open Source DB for Science (Score:2, Interesting)
http://www.scidb.org/ [scidb.org]
Use a document mangement system (Score:3, Informative)
I would recommend just downloading a VM or cloud image of something like Knowledge Tree or Alfresco (I personally prefer Alfresco), and run it on the free vmwareplayer or a real VM solution if you have one.
I recently setup a demo showing the benefits of such a system, I was able to, in about one day, download and setup Alfresco, expose CIFS interface (ie, \\192.168.x.x\documents) and just dump a portion of my entire document base into the system. After digestion, the system had all the documents full-text-indexed (yes, even word docs and excel files thanks to OpenOffice libraries), and I could go about changing directory structure, moving around and renaming files, etc. .. and the source control would show me changes. In fact, I could go into the backend and write SQL queries if I wanted to with detailed reports of how things were on date X or Y revisions ago. Was quite sweet. All the while, the users still saw the same windows directory structure and modifications they made there would be versioned and modified in Alfresco's database.
Here is a bitnami VM image [bitnami.org], will save you days of configuration. If the solution works for you, but is slow, just DL the native stack and migrate or re-import.
TiddyWiki (Score:2)
A personal wiki that runs from one file, I link my files from there and I can add documentation and references at will. http://www.tiddlywiki.com/ [tiddlywiki.com]
Directory names : experiment number (Score:2)
Just give all your datasets a number and put them in a database so you can search on all criteria you want.
You can also use Excel to keep track of your metadata if you want...
Great comments (Score:3, Insightful)
Reading these comments has changed my thoughts on data storage a little bit, but has reinforced my idea that databases are a bad idea for this sort of thing.
The main issues I have with using databases are file size (I store and convert text files that are 10-100MB zipped), and mutability (generated data doesn't typically change, I just add new experiments on top of other data). A secondary issue is that for plain-text data files (or plain-text convertible data files), writing code is easier when you don't have to bother about a database middleman.
So, if I were to do [another] large research project in the future, here's my thoughts on what I would consider an appropriate approach:
My most common uses for old data are re-running analyses (generating new data as results), and sending data to someone else. It helps to be able to make those things as quick as possible.
t really depends, be more specific (Score:3, Interesting)
I am a programmer, who works closely with scientists in scientific computing in the fields of fluid mechanics simulation, and aerodynamics simulation.
Your question is really not clear, in both these fields that I work on, the requirements vary vastly, and it also varies to the users I support (over 100 scientist). some of them have huge data sets, spanning up to 600 GB/file, a single simulation run can give a geologist a 1 TB file.
Others, have a few hundred MB of data. Each is handled differently.
The data itself, can be parsed and stored in in a DB for analysis in some cases, and in others, that is very impractical and will slow down your work.
Each scientist has a different way of doing things.
So the bottom line, if you want any useful answers, be more specific. What field of science (i can tell you are a chemist?), what simulations/tests do you use, how fine are your models are your data sets and what is their format, what kind of data are you interested in, you should seriously consider an archiving solution because i guarantee you will run out of space.
Re: (Score:2)
Re: (Score:2)