Building and Maintaining Large, Collaborative Databases? 32
hherb asks: "We are in the process of building and maintaining a free pharmaceutical reference database, in order to liberate medical decision support systems from vendor driven databases. For that purpose, we need some way of allowing multiple authors to contribute to a large number of data records - most of them small...too small for CVS. We need version tracking as well as authentication of authors. We need to tag every bit of information enterrred with information about source reference, author, peer review result, and so forth. I had a look at existing version tracking software. like CVS and Subversion, and I did not have the impression that any of them would suit our needs. Does anyone have ideas for *free* software solutions that we can use?"
Free as in Beer ? (Score:2)
I guess I don't understand why CVS won't work for you, I assume you create a record for each drug/whatever and then track them... So you end up with a flat directory structure that holds all of 1,000,000,000 records. If CVS won't work for you maybe you should look i
Re:Free as in Beer ? (Score:1)
Hence, we need a web interface and cannot use a more suitable tool like a XML editor which would enforce compliance of entered data with a DTD.
For medicolegal and other reasons, we need to be able to backtrack which junk of information has been modified by whom and when.
Currently, as you can
Re:Free as in Beer ? (Score:1)
wikipedia is a good example (Score:1)
Re:wikipedia is a good example (Score:2)
The Wiki Way (Score:2)
(Warning! A lot of remainders vendors have cheap copies of this book, even though it's still in print. The CDs on these copies are screwed up, but are usable if you know to convert the Perl scripts from Macintosh format!)
Re:wikipedia is a good example (Score:1)
Sounds like simple data entry (Score:2)
Re:Sounds like simple data entry (Score:1)
The problem is the version tracking - assigning authorship to the differences between records - since no matter how normalized a record (table row) is, it might always have a number of authors. OTOH, this tracking/auditing must not obfuscate the record text (inline tagging not feasi
Re:Sounds like simple data entry (Score:1)
The problem is the version tracking - assigning authorship to the differences between records - since no matter how normalized a record (table row) is, it might always have a number of authors. OTOH, this tracking/auditing must not obfuscate the record text (inline tagging not feasible)
Alas, not as simple as it looks.
Unless I'm missing something, it should be simple. I know there are other solutions out there but one solution I would suggest is like so:
DIY (Score:2)
This isn't very difficult, and anyone with data modeling experience can pull it off. As far as free software goes, just use PostgreSQL or MySQL and a PERL or PHP or JSP front end. Problem solved.
Re:DIY (Score:1)
Just copying the whole record for autit purposes does not help, since we need to assign authorship to the changes, not to the whole record.
Re:DIY (Score:1)
You could store a whole new copy of the record along with the set of people responsible for that version. Each author entry can have an associated revision comment. That would probably be enough of a "paper trail" that the group of responsible people is small enough for an investigation to flush out the remaining details.
It seems that if the record is small, then th
PostGreSQL (Score:3, Informative)
You might want to try an open-source relational database engine like PostGreSQL [postgresql.org]
This means, of course, you'll have to create your own schemata for logging version control and records added, but versioning additions to the database -- and user/connect information -- can be tracked with a combination of triggers and timestamps in a separate table. These are fairly standard techniques [goucher.edu] in generating an audit trail [cosyn.co.nz] for tracking relational database changes and enforcing data integrity constraints. While the syntax varies from database engine to database engine--PostGreSQL is a good (free) place to start, and if the spirit moves you (or if the DB becomes too large) the syntax differences do not preclude your moving to, say, a development edition of IBM's DB2 UDB [ibm.com] or even (aack!) Oracle [oracle.com]. All three run rather nicely under SuSE Linux, and are said to run quite nicely under RedHat, as well.
It probably wouldn't harm you at all to develop this database with a relational database, and test it under different engines-- in fact, it would likely make you highly employable in the very near future.
Re:PostGreSQL (Score:1)
However, the granularity of versioning and audit trailing typically used in relational daabases does not suit our needs, see my coments to other replies.
As far as "being employable" goes: I don't. I do employ people, including doctors, since I am
Re:PostGreSQL (Score:2)
Zope (Score:2)
It sounds like you're really talking about content management more than version control. Maybe Zope [zope.org] and Plone [plone.org] will do it for you? Out of the box it gives you versioning, authentication and a decent database. The database isn't the most scalable, but you can extend it to use any number of database and database-like backends, like MySQL, PostgreSQL, Berkeley, and various pay-for dbs.
A FOIA project (Score:2)
The VA has the VistA project, which covers pretty much every aspect of medicine. The application is based on 30 year old technology known as M (aka MUMPS, which evolved into "Caché".) The source code is not GPL, it is available because of the Freedom Of Information Act (FOIA). And IIRC I believe RMS said a long time ago that the VA FileMan code was GPL-compatible.
The normal audit trail for record addition/modidifcation sounds like it would cover your needs. But more likely, th
Re:A FOIA project (Score:1)
There are parts of the VistA system which are not included in the FOIA release. The poster's primary interest (medication instructions) would appear to be one of those bits. The bits which aren't covered are primarily information copyrighted by other organizations (like the AMA).
If you aren't already an experienced M programmer, I would recommend having a large bottle of your favorite painkiller handy for when you review the code, it is pret
Re:A FOIA project (Score:1)
Content Management system, not Version Control (Score:2)
Try one of the following for size:
Tiki [sourceforge.net]
phpWebSite [sourceforge.net]
OpenDocMan [sourceforge.net]
Your workflow requirements seem to be a lot more sophisticated compared to mo
Re:Content Management system, not Version Control (Score:2)
eZ Publish [ez.no] from ez.no, and a bunch of others listed at OSCOM [oscom.org]
But I would stay well away from some of the new stuff. There are plenty of CMS's out there without a project needing to use something that is new and is a duplicate of an existing product.
Use File Loading Processes (Score:3, Interesting)
The way we cope with this problem is that each data source is given a code (it's usually just the filename). We have a Perl program parse these files (they're comma delimited ascii, tab delimited, DBase IV, etc., and some even human-readable reports), and load the database with the contents. Each record includes the source ID, for easy attribution / tracking.
We keep each file version for a while. Each file has a business date so if they want to clobber a previous version of that data, they get to do so.
This could keep your troubles to a miminum. Write a parser and have a file-upload site that lets people upload data. Define a group of people if you want. They should only be able to add/replace/delete their own data by the nature of the file. Each group can only create a certain filename or the group id is in the file.
This way, many can share database updates via a batch run where updates are tracked and possibly even approved before committing them.
CMS (Score:1)
Few questions: (Score:2)
How are you going to select your contributors? How you are going check for the correctness of their data input?
How do you plan to solve the problem with synonymous keywords in the search? [Good old- fashion Chemical Abstract database published realy extensive indexes on general subject keywords - apart from indexes on the substances, chemical core structures etc. You will need to do this and it is a lot of awful work]
The main problem is you will need a
Re:Few questions: (Score:1)
Correctness of data is checked by peer review.
We have literally hundreds of volunteering qualified physicians and pharmacologists to perform data entry and peer review - any hands make any work small.
The medicolegal aspect of it is exactly the reason why we need a finer granularity of database auditing than most projects do.
Our intended "customers" (our work is free in every aspect!) are health professionals wanting to use independent medical software and decision sup
Try TWiki (Score:1)
TWiki is a Wiki variant that adds RCS for every page that is editable, the ability to have access control lists for different pages, yet like Wiki has an "edit this page" button, and is very accessible to those you are not familiar with HTML. TWiki also has a large base of "plugins" that allow you to so
What other functionality will you need? (Score:2, Informative)
Bugzilla (Score:3, Informative)
I have a modified version of Bugzilla running on Windows (W2KAS/IIS) that is being tested for something vaguely similar. It works great. The hard part was getting it to work on IIS with ActivePerl - if you're running Apache/Linux or BSD it won't be nearly as big a chore. I didn't even have to touch the MySQL schemas.
Wiki? (Score:1)
Also has version control features, peer review, tags etc etc.
HTH.
Oz
Use a relational DB as a delta-system (Score:1)
drugs:
drugID (unique, autoincrement, long/int)
drugName (text)
users:
userid (unique, autoincrement, long/int)
name (text)
securityLevel (if you want multiple levels, if not everyone's the same level)
data:
dID (same as above)
uID (same as above)
date (timestamp)
added (text)
deleted (text)
rating (float)
Then each addition is run through the UNIX command
Content management systems (CMS) not CVS... (Score:3, Informative)
Of all of these, I like OpenACS the best, mostly because of its developer community. There are a lot of great people involved, and there's a high signal to noise ratio on the developer forums. Even though OpenACS probably has the least of what you're looking for, it might be the easiest to develop. OpenACS runs on top of Postgres or Oracle, and is written in Tcl.
Redhat CCM is basically a Java rewrite of the original OpenACS. Its CMS modules are supposedly more mature. It runs on a Redhat version of Postgres, and I think Oracle too.
Zope is a whole lotta product, and probably has most of what you're looking for. However, I find it kind of murky, difficult to figure out. YMMV.
These three are the most promising in terms of developer community. This is a bigger undertaking than it might seem at the outset. You'll need all the help you can get, and getting involved with these communities will spare you from trying to reinvent the wheel.
Of course, I'd love to have you guys use and extend the OpenACS toolkit, and share your efforts with the community!