Building a Searchable Literature Archive With Keywords? 211
Sooner Boomer writes "I'm trying to help drag a professor I work with into the 20th century. Although he is involved in cutting-edge research (nanotechnology), his method of literature search is to begin with digging through the hundreds of 3-ring binders that contain articles (usually from PDFs) that he has printed out. Even though the binders are labeled, the articles can only go under one 'heading' and there's no way to do a keyword search on subject, methods, materials, etc. Yeah, google is pretty good for finding stuff, as are other on-line literature services, but they only work for articles that are already on-line. His literature also includes articles copied from books, professional correspondence, and other sources. Is there a FOSS database or archive method (preferably with a web interface) where he could archive the PDFs and scanned documents and be able to search by keywords? It would also be nice to categorize them under multiple subject headings if possible. I know this has been covered ad nauseum with things like photos and the like, but I'm not looking at storage as such: instead I'm trying to find what's stored."
Document Management Software and OCR (Score:5, Informative)
From the sound of it, you want to verify that your product supports document tagging (not unlike Slashdot's tagging system I guess) so that he can attach his categories to documents as he puts them in (or more likely as you do the manual labor, right?).
... where he could archive the PDFs and scanned documents and be able to search by keywords?
So, my big concern is the part where you said he scans things from books and articles and so some of the PDFs might just be massive images, right? I don't think you're going to find systems with OCR built in so you might have quite the chore on your hands. If you don't have it electronically or if it's just an image electronically, you may have to implement some sort of process for getting a doc into this system so it can be searched, right? Look into GOCR [sourceforge.net] or Tesseract [linuxjournal.com] if this is the case.
Also, judging by your nickname ("Sooner Boomer"), you're at the University of Oklahoma. Why in the world would you name yourself after a group of people [wikipedia.org] who not only disobeyed the Indian Appropriation Act but also moved out onto Native American territory before it was officially declared property of the United States? And then you also chose "Boomer" which refers to "white settlers who believed the Unassigned Lands were public property and open to anyone for settlement, not just Indian tribes. Their reasoning came from a clause in the Homestead Act of 1862, which said that any settler could claim 160 acres of public land. Some boomers entered and were removed more than once by the United States Army." If you are a descendant of either a Sooner or a Boomer, I respectfully do not agree with their actions.
Try Papers (Score:4, Informative)
Zotero (Score:3, Informative)
Zotero [zotero.org] might be useful.
Quick and dirty solution (Score:3, Informative)
So, what I think you're asking for is... (Score:5, Informative)
...something like this:
1. You want to be able to store documents that currently exist electronically, and also handle documents you're going to scan. The latter may, or may not, be OCR'd.
2. You want to attach keywords to the articles, and be able to bring up a list of articles that match some arbitrary combination of these keywords.
3. Full-text search isn't as important (but would be useful if available).
If that's the case, I'm thinking Alfresco [alfresco.com] might be what you're looking for. Multi-platform, open source, java-based content repository. Supports document tagging (and loads, loads more). Relatively easy to use right out of the box, and has a CIFS interface so you can just create a project and simply tree-copy your current documents into the project. Don't let the "enterprise" designation on the software scare you away.
I've actually considered going that route for my own personal document library, but while Alfresco might be one of the only good solutions, it's like killing a fly with a cannon.
I'm frankly amazed that with the "paperless living" meme currently going through the productivity circles that someone hasn't come up with a simple tool to do something just like what you're looking for: point it at a root folder, let it suck in all the files, then start tagging away. Search with keywords or filenames or both, and provide a clickable list of hits. Full-text search isn't needed, as there's already a ton of tools out there that'll happily index your hard drive for you.
And, if a tool like that exists, could someone point me to it, please?
Re:Document Management Software and OCR (Score:3, Informative)
For an Open Source DMS that generates searchable PDF Files, try ArchivistaBox: http://sourceforge.net/projects/archivista/ [sourceforge.net]
Tesseract (including fracture / black-letter recognition) and the Linux port of Cuneiform (BSD licence) OCR engines are used for text recognition. The hocr2pdf module (see http://www.exactcode.de/ [exactcode.de] is used to generate the searchable PDF files.
(http://sourceforge.net/forum/forum.php?forum_id=868471)
Re:Zotero (Score:2, Informative)
I'll second this one. I'm a doctoral student, and have been using it to handle my research. A nice, simple firefox-based interface. It'll snarf references right off pages from search engines. You can attach things, including links to or copies of pdfs to those references, summaries, etc. You can apply keyword tags to citations, and you can organize the citations into a nice directory tree.
To get them out, there's a sweet interface available for open office. I think it's also available for Word, but I use M$ as little as possible.
The only real downsides are copying between computers and citation formats.
Copying is actually easier than it is with the other reference managers I've tried (yeah, I'm talking about you, refworks (bleaargh!) and end note (urrrp!)). You may have to do it more than you would with others, but it's easy to do when you need to. You can export some or all your references to a file, sneakernet the file to the new computer, import it into zotero, and you're done.
There are a lot of citation formats currently available, they just don't happen to be the ones I need. However, there's one close, and the system is designed to be extensible; it's not *that* hard to add your own styles. As soon as I get a round tuit I'll be adding styles for the journals I'll be submitting to, and contribute those back to the project.
Like I said, really sweet, and free.
Zotero, Mendeley (Score:2, Informative)
Zotero is a firefox extension that can grab reserach papers directly from the journal or library web sites. It organizes the papers in collections, has keywords (they call them tags), can automatically index the PDFs. The metadata is stored also on a remote server and you can browse through it using a web interface. You also get a Word and Openoffice plugins to insert citations in the papers you write. The plugins are a little rough around the edges, but are usable. The references formatting is very robust and comes with styles for a lot of journals.
Mendeley is stand alone application. I haven't tryed it yet,but is seems to have very similar functionality.
BibDesk (Score:1, Informative)
Re:Zotero (Score:2, Informative)
Re:Document Management Software and OCR (Score:2, Informative)
Archivista seems to be be a solution worth looking into for him. I guess he has to try and install the software and test it himself, because the explanation on the (English) website is almost incomprehensible -- I understand spoken German, but mixing German and English, ugh!
I work with Alfresco, which is a nice DMS, but without the integrated OCR that Archivista seems to provide. Alfresco can integrate with various OCR solutions though, has a very active community -- and a comprehensible website!
The bigger problem is not OCR, it's which scanner. (Score:3, Informative)
The big attraction of the fi-6130 is its speed: 40 pages per minute.
If you are interested, I suggest you download the manual [fujitsu.com]. (PDF)
The manual talks about a connection for an "imprinter", which sounds as though it is a printer that works only with that particular Fujitsu scanner. That causes me to doubt whether buying from Fujitsu would be a good idea; we don't want to get involved with corporate marketing drone foolishness. Everything else, however, looks quite good.
The scanner comes with OCR software. I suppose and hope that Fujitsu did a lot of work and found the best OCR software.
The scanner software makes a PDF. The OCR software tries to recognize the words, so that the software can make a searchable PDF. Even if the OCR recognition isn't perfect, it can be very useful.
It seems to me that the Fujitsu fi-6230, suggested in the parent comment, is a poor design. It combines an automatic sheetfed scanner with a flatbed scanner for a lot more money. That doesn't make sense, since the attractive feature of the sheetfed scanner is its speed. Speed is important with a flatbed scanner, but not as important, since the operation will always be manual. It seems to me that it would be better to have a flatbed scanner that is a separate piece of equipment, rather than two pieces seemingly glued together, without any logical connection, since apparently the 6250 has two imaging elements.
Be careful about using Windows Search, as suggested in the parent comment. The Windows XP version is buggy, and sometimes won't look into files that are there. We use VCOM's PowerDesk pdfind.exe program, a older version of which is free. We also use Funduc Software's Search and Replace program.
Most scanners are quite slow, don't have automatic document feeders that allow scanning of papers of widely different sizes, and don't build OCR'd indexes inside the PDF files.
Re:The bigger problem is not OCR, it's which scann (Score:2, Informative)
Re:Reference management software (Score:2, Informative)
You can also set your preferences in Google Scholar to provide a link to the paper's citation in BibTex formatting. It isn't always 100% accurate, but it does save a lot of data entry time.
Re:Document Management Software and OCR (Score:1, Informative)
And as a current reference librarian at a major research institution I'll add my two cents.
The journal articles that probably make up the bulk of his printed pdfs are already online, and professionally indexed, in the databases that his institution subscribes to or in Open Access repositories. He could use zotero to store a link to the original in that database and keep a local and fulltext indexed copy on his harddrive. And you won't have to worry about OCR. Depending on the field even many of the book chapters may be available as ebooks that could be handled the same way.
He could of course use repository software like DSpace or Fedora to create his own repository but why bother when the end result is ultimately redundant. Not to mention the fact that he won't be the copyright owner for much of what he archives so sharing the fruit of all this work could have legal complications.
And besides, if he's like most faculty, the contents of that binder are probably out-of-date, no matter how beloved they are.