Building a Searchable Literature Archive With Keywords?

Building a Searchable Literature Archive With Keywords? 211

Posted by timothy on Wednesday April 08, 2009 @04:20PM from the must-be-in-here-somewhere dept.

Sooner Boomer writes "I'm trying to help drag a professor I work with into the 20th century. Although he is involved in cutting-edge research (nanotechnology), his method of literature search is to begin with digging through the hundreds of 3-ring binders that contain articles (usually from PDFs) that he has printed out. Even though the binders are labeled, the articles can only go under one 'heading' and there's no way to do a keyword search on subject, methods, materials, etc. Yeah, google is pretty good for finding stuff, as are other on-line literature services, but they only work for articles that are already on-line. His literature also includes articles copied from books, professional correspondence, and other sources. Is there a FOSS database or archive method (preferably with a web interface) where he could archive the PDFs and scanned documents and be able to search by keywords? It would also be nice to categorize them under multiple subject headings if possible. I know this has been covered ad nauseum with things like photos and the like, but I'm not looking at storage as such: instead I'm trying to find what's stored."

Building a Searchable Literature Archive With Keywords?

This discussion has been archived. No new comments can be posted.

Search 211 Comments Log In/Create an Account

Comments Filter:

Document Management Software and OCR (Score:5, Informative)

by eldavojohn ( 898314 ) * writes: <eldavojohn@noSpAM.gmail.com> on Wednesday April 08, 2009 @04:21PM (#27508859) Journal

I think what you are looking for is something called "document management [google.com]" software. As far as FOSS goes, KnowledgeTree [knowledgetree.com] offers a community version that might be down your alley. They have an online demo if you're interested. There's also Alfresco [alfresco.com] but I haven't tried either of these.

From the sound of it, you want to verify that your product supports document tagging (not unlike Slashdot's tagging system I guess) so that he can attach his categories to documents as he puts them in (or more likely as you do the manual labor, right?).

... where he could archive the PDFs and scanned documents and be able to search by keywords?
So, my big concern is the part where you said he scans things from books and articles and so some of the PDFs might just be massive images, right? I don't think you're going to find systems with OCR built in so you might have quite the chore on your hands. If you don't have it electronically or if it's just an image electronically, you may have to implement some sort of process for getting a doc into this system so it can be searched, right? Look into GOCR [sourceforge.net] or Tesseract [linuxjournal.com] if this is the case.

Also, judging by your nickname ("Sooner Boomer"), you're at the University of Oklahoma. Why in the world would you name yourself after a group of people [wikipedia.org] who not only disobeyed the Indian Appropriation Act but also moved out onto Native American territory before it was officially declared property of the United States? And then you also chose "Boomer" which refers to "white settlers who believed the Unassigned Lands were public property and open to anyone for settlement, not just Indian tribes. Their reasoning came from a clause in the Homestead Act of 1862, which said that any settler could claim 160 acres of public land. Some boomers entered and were removed more than once by the United States Army." If you are a descendant of either a Sooner or a Boomer, I respectfully do not agree with their actions.

Try Papers (Score:4, Informative)

by matt4077 ( 581118 ) writes: on Wednesday April 08, 2009 @04:26PM (#27508955) Homepage

Papers [mekentosj.com] is a Mac software that does exactly what you need, and does it very well. It's not webbased and Mac only unfortunately, but you can probably find out there what the right terms to google for are.

Zotero (Score:3, Informative)

by k2enemy ( 555744 ) writes: on Wednesday April 08, 2009 @04:35PM (#27509107)

Zotero [zotero.org] might be useful.

Quick and dirty solution (Score:3, Informative)

by oldhack ( 1037484 ) writes: on Wednesday April 08, 2009 @04:40PM (#27509205)

Assuming you have electronic versions of the documents in one format or another, stick them all in a file system and use desktop search (MS or Google). More than that you're looking at good bit of time and money.

So, what I think you're asking for is... (Score:5, Informative)

by Basilius ( 184226 ) writes: on Wednesday April 08, 2009 @05:00PM (#27509515)

...something like this:
1. You want to be able to store documents that currently exist electronically, and also handle documents you're going to scan. The latter may, or may not, be OCR'd.
2. You want to attach keywords to the articles, and be able to bring up a list of articles that match some arbitrary combination of these keywords.
3. Full-text search isn't as important (but would be useful if available).
If that's the case, I'm thinking Alfresco [alfresco.com] might be what you're looking for. Multi-platform, open source, java-based content repository. Supports document tagging (and loads, loads more). Relatively easy to use right out of the box, and has a CIFS interface so you can just create a project and simply tree-copy your current documents into the project. Don't let the "enterprise" designation on the software scare you away.
I've actually considered going that route for my own personal document library, but while Alfresco might be one of the only good solutions, it's like killing a fly with a cannon.
I'm frankly amazed that with the "paperless living" meme currently going through the productivity circles that someone hasn't come up with a simple tool to do something just like what you're looking for: point it at a root folder, let it suck in all the files, then start tagging away. Search with keywords or filenames or both, and provide a clickable list of hits. Full-text search isn't needed, as there's already a ton of tools out there that'll happily index your hard drive for you.
And, if a tool like that exists, could someone point me to it, please?

Re:Document Management Software and OCR (Score:3, Informative)

by burki ( 32245 ) writes: on Wednesday April 08, 2009 @05:06PM (#27509595)

For an Open Source DMS that generates searchable PDF Files, try ArchivistaBox: http://sourceforge.net/projects/archivista/ [sourceforge.net]
Tesseract (including fracture / black-letter recognition) and the Linux port of Cuneiform (BSD licence) OCR engines are used for text recognition. The hocr2pdf module (see http://www.exactcode.de/ [exactcode.de] is used to generate the searchable PDF files.
(http://sourceforge.net/forum/forum.php?forum_id=868471)

Re:Zotero (Score:2, Informative)

by hnwombat ( 172691 ) writes: on Wednesday April 08, 2009 @05:36PM (#27510047)

I'll second this one. I'm a doctoral student, and have been using it to handle my research. A nice, simple firefox-based interface. It'll snarf references right off pages from search engines. You can attach things, including links to or copies of pdfs to those references, summaries, etc. You can apply keyword tags to citations, and you can organize the citations into a nice directory tree.
To get them out, there's a sweet interface available for open office. I think it's also available for Word, but I use M$ as little as possible.
The only real downsides are copying between computers and citation formats.
Copying is actually easier than it is with the other reference managers I've tried (yeah, I'm talking about you, refworks (bleaargh!) and end note (urrrp!)). You may have to do it more than you would with others, but it's easy to do when you need to. You can export some or all your references to a file, sneakernet the file to the new computer, import it into zotero, and you're done.
There are a lot of citation formats currently available, they just don't happen to be the ones I need. However, there's one close, and the system is designed to be extensible; it's not *that* hard to add your own styles. As soon as I get a round tuit I'll be adding styles for the journals I'll be submitting to, and contribute those back to the project.
Like I said, really sweet, and free.

Zotero, Mendeley (Score:2, Informative)

by pesho ( 843750 ) writes: on Wednesday April 08, 2009 @05:56PM (#27510357)

You should try Zotero [zotero.org] or Mendeley [mendeley.com].
Zotero is a firefox extension that can grab reserach papers directly from the journal or library web sites. It organizes the papers in collections, has keywords (they call them tags), can automatically index the PDFs. The metadata is stored also on a remote server and you can browse through it using a web interface. You also get a Word and Openoffice plugins to insert citations in the papers you write. The plugins are a little rough around the edges, but are usable. The references formatting is very robust and comes with styles for a lot of journals.
Mendeley is stand alone application. I haven't tryed it yet,but is seems to have very similar functionality.

BibDesk (Score:1, Informative)

by Anonymous Coward writes: on Wednesday April 08, 2009 @06:09PM (#27510559)

I use BibDesk to organize my research. It's not perfect for that as its basic use is to cite and all that but you can actually import pdfs and tag them, i.e. put them into smart folders. It does have the collaborative approach to organizing data in a flat structure similar to that of delicious.

Re:Zotero (Score:2, Informative)

by yes it is ( 1137335 ) writes: on Wednesday April 08, 2009 @07:48PM (#27511805)

Thirded. I've also built my own phrase index on top of zotero using Perl and OpenCalais.

Re:Document Management Software and OCR (Score:2, Informative)

by RicRoc ( 41406 ) writes: on Wednesday April 08, 2009 @08:32PM (#27512229) Homepage

Archivista seems to be be a solution worth looking into for him. I guess he has to try and install the software and test it himself, because the explanation on the (English) website is almost incomprehensible -- I understand spoken German, but mixing German and English, ugh!
I work with Alfresco, which is a nice DMS, but without the integrated OCR that Archivista seems to provide. Alfresco can integrate with various OCR solutions though, has a very active community -- and a comprehensible website!

The bigger problem is not OCR, it's which scanner. (Score:3, Informative)

by Futurepower(R) ( 558542 ) writes: on Wednesday April 08, 2009 @08:49PM (#27512357) Homepage

We've done considerable research on the problem of scanning documents, also, and came to the same conclusion: The new Fujitsu fi-6130 [newegg.com] seems excellent, although we haven't tried it. That model and the 6230 are new, and there is some evidence that waiting for the second version of those models would be a good idea.

The big attraction of the fi-6130 is its speed: 40 pages per minute.

If you are interested, I suggest you download the manual [fujitsu.com]. (PDF)

The manual talks about a connection for an "imprinter", which sounds as though it is a printer that works only with that particular Fujitsu scanner. That causes me to doubt whether buying from Fujitsu would be a good idea; we don't want to get involved with corporate marketing drone foolishness. Everything else, however, looks quite good.

The scanner comes with OCR software. I suppose and hope that Fujitsu did a lot of work and found the best OCR software.

The scanner software makes a PDF. The OCR software tries to recognize the words, so that the software can make a searchable PDF. Even if the OCR recognition isn't perfect, it can be very useful.

It seems to me that the Fujitsu fi-6230, suggested in the parent comment, is a poor design. It combines an automatic sheetfed scanner with a flatbed scanner for a lot more money. That doesn't make sense, since the attractive feature of the sheetfed scanner is its speed. Speed is important with a flatbed scanner, but not as important, since the operation will always be manual. It seems to me that it would be better to have a flatbed scanner that is a separate piece of equipment, rather than two pieces seemingly glued together, without any logical connection, since apparently the 6250 has two imaging elements.

Be careful about using Windows Search, as suggested in the parent comment. The Windows XP version is buggy, and sometimes won't look into files that are there. We use VCOM's PowerDesk pdfind.exe program, a older version of which is free. We also use Funduc Software's Search and Replace program.

Most scanners are quite slow, don't have automatic document feeders that allow scanning of papers of widely different sizes, and don't build OCR'd indexes inside the PDF files.

Re:The bigger problem is not OCR, it's which scann (Score:2, Informative)

by Angstroman ( 747480 ) writes: on Wednesday April 08, 2009 @10:50PM (#27513179)

I have been using an fi-6130 for several months now. It is quite simply the best scanner I have used. It is fast, highly reliable and very seldom misfeeds (1 per 500-800 pages in my experience). I use it for scanning archival financial records and also for technical papers. It includes a copy of Kofax Virtual ReScan, which does a great job of creating readable 1-bit monotone scans of originals with colored backgrounds. There are a number of possible target formats, and it has several automated ways of handling group separator sheets. I highly recommend it. I have seen no evidence of "marketing drone foolishness."

Re:Reference management software (Score:2, Informative)

by cahkaylahlee ( 1298367 ) writes: on Wednesday April 08, 2009 @11:03PM (#27513269)

You can also set your preferences in Google Scholar to provide a link to the paper's citation in BibTex formatting. It isn't always 100% accurate, but it does save a lot of data entry time.

Re:Document Management Software and OCR (Score:1, Informative)

by Anonymous Coward writes: on Thursday April 09, 2009 @09:51AM (#27517553)

And as a current reference librarian at a major research institution I'll add my two cents.
The journal articles that probably make up the bulk of his printed pdfs are already online, and professionally indexed, in the databases that his institution subscribes to or in Open Access repositories. He could use zotero to store a link to the original in that database and keep a local and fulltext indexed copy on his harddrive. And you won't have to worry about OCR. Depending on the field even many of the book chapters may be available as ebooks that could be handled the same way.
He could of course use repository software like DSpace or Fedora to create his own repository but why bother when the end result is ultimately redundant. Not to mention the fact that he won't be the copyright owner for much of what he archives so sharing the fruit of all this work could have legal complications.
And besides, if he's like most faculty, the contents of that binder are probably out-of-date, no matter how beloved they are.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Building a Searchable Literature Archive With Keywords? 211

Building a Searchable Literature Archive With Keywords? More Login

Building a Searchable Literature Archive With Keywords?

Document Management Software and OCR (Score:5, Informative)

Try Papers (Score:4, Informative)

Zotero (Score:3, Informative)

Quick and dirty solution (Score:3, Informative)

So, what I think you're asking for is... (Score:5, Informative)

Re:Document Management Software and OCR (Score:3, Informative)

Re:Zotero (Score:2, Informative)

Zotero, Mendeley (Score:2, Informative)

BibDesk (Score:1, Informative)

Re:Zotero (Score:2, Informative)

Re:Document Management Software and OCR (Score:2, Informative)

The bigger problem is not OCR, it's which scanner. (Score:3, Informative)

Re:The bigger problem is not OCR, it's which scann (Score:2, Informative)

Re:Reference management software (Score:2, Informative)

Re:Document Management Software and OCR (Score:1, Informative)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot