Building a Searchable Literature Archive With Keywords?

Building a Searchable Literature Archive With Keywords? 211

Posted by timothy on Wednesday April 08, 2009 @04:20PM from the must-be-in-here-somewhere dept.

Sooner Boomer writes "I'm trying to help drag a professor I work with into the 20th century. Although he is involved in cutting-edge research (nanotechnology), his method of literature search is to begin with digging through the hundreds of 3-ring binders that contain articles (usually from PDFs) that he has printed out. Even though the binders are labeled, the articles can only go under one 'heading' and there's no way to do a keyword search on subject, methods, materials, etc. Yeah, google is pretty good for finding stuff, as are other on-line literature services, but they only work for articles that are already on-line. His literature also includes articles copied from books, professional correspondence, and other sources. Is there a FOSS database or archive method (preferably with a web interface) where he could archive the PDFs and scanned documents and be able to search by keywords? It would also be nice to categorize them under multiple subject headings if possible. I know this has been covered ad nauseum with things like photos and the like, but I'm not looking at storage as such: instead I'm trying to find what's stored."

Building a Searchable Literature Archive With Keywords?

This discussion has been archived. No new comments can be posted.

Search 211 Comments Log In/Create an Account

Comments Filter:

Re:Document Management Software and OCR (Score:3, Interesting)

by Red Flayer ( 890720 ) writes: on Wednesday April 08, 2009 @04:33PM (#27509073) Journal

I think what you are looking for is something called "document management" software.
Ugh... for a secon there, I thought Clippy started posting to slashdot. Would you like help with that?

I don't think he's looking for a DMS, which includes lots of things like workflows, audit trails, etc. DMSs are typically used to make an office go paperless, but he's not looking for a processing and tracking mechanism. He's looking for an easy way to create a searchable archive index.

My suggestion? Since he's a professor, get a bunch of students to "help with research" by scanning the docs and OCRing them. If he's willing to shell out a few hundred bucks from his research grant, there are services that will do this... Most of the best OCR tools are proprietary, not open source, but even a crappy one should get enough text that the OCRed files could be indexed usefully.

For an indexer, I've heard good things about MPS, and a friend did a similar project to yours with Yaz/Zebra, but he was working with a library, there may have been a special reason for that.

Personal Document Management (Score:4, Interesting)

by steveha ( 103154 ) writes: on Wednesday April 08, 2009 @04:38PM (#27509157) Homepage
I am hoping that someone will make a nice personal document management package as free software.
If you use Windows, you can buy this:
http://www.nuance.com/paperport/ [nuance.com]
The basic features would be:
- Scan in a document (group multiple pages into a single PDF)
- Easily scan a page and insert it into a pre-existing PDF (if you missed a page yesterday, today go back and put it in)
- OCR the documents and provide an index to allow searching
- Provide a really convenient photocopier feature (scan+print)
- Fast and easy. Scan in color, but detect black-and-white and auto-convert to greyscale. Do not pop up any dialogs; when the user clicks on the "Scan!" button, start scanning.
- Also allow dropping in saved HTML pages, OpenOffice.org documents, etc. Manage the user's saved documents, no matter what kind of documents they are.
In a perfect world, the GNOME guys and the KDE guys would both start competing over who can make the slickest product and we all would win.
steveha
Summation (Score:2, Interesting)

by Anonymous Coward writes: on Wednesday April 08, 2009 @04:40PM (#27509193)

Law firms use a program called Summation to do this all the time. They take all the paper docs and electronic docs in a case (sometimes tens of thousands of pages) and load them into this program as TIFFs or PDFs. They are then OCR searchable. Not nearly as good of a search algo as something like google, as it is purely Boolean...but it gets the job done. Not sure about cost, but your university may have a license. An alternative is a program called Concordance, which does the same thing. One last option would be to scan everything to OCR searchable PDFs, throw them into a folder, and setup google desktop to only search that folder...you could then essentially "google" the contents of all those PDFs.

Comment removed (Score:4, Interesting)

by account_deleted ( 4530225 ) writes: on Wednesday April 08, 2009 @04:45PM (#27509283)

Comment removed based on user account deletion

Re:Document Management Software and OCR (Score:3, Interesting)

by digitalunity ( 19107 ) writes: <digitalunityNO@SPAMyahoo.com> on Wednesday April 08, 2009 @05:06PM (#27509599) Homepage

Another obvious, more practical but potentially less powerful method is simply to index all of the printouts with serial numbers and manually create a database of tags with serial numbers.
You then have a library card index system, but electronic. Sure, it won't help with documents that aren't entered properly, but it's dramatically more efficient than thumbing through PDF's until you find what you need.

Re:Reference management software (Score:2, Interesting)

by eggy78 ( 1227698 ) writes: on Wednesday April 08, 2009 @05:06PM (#27509605)

I never really enjoyed using JabRef, but have had pretty good luck with Aigaion [aigaion.nl]... a little more setup but it's great for our lab, where everyone works from a common database of papers. It allows export to RIS, BibTex, etc. although we do occasionally run into some errors with the LaTeX special characters and such. At least as far as our advisor is concerned, this absolutely revolutionized the way we handled our references. It's searchable, you can add keywords, your own annotations, include abstracts, and upload one or more attachments (the original paper) in whatever format you want. Technically it's an annotated bibliography that supports attachments, but it is pretty solid. One thing to note: We are still using a 1.3.x version of it; we haven't been brave enough or had the time to try the 2.x releases.

Re:Reference management software (Score:3, Interesting)

by joe 155 ( 937621 ) writes: on Wednesday April 08, 2009 @05:34PM (#27510007) Journal

I agree. I'm in the first year of my PhD and I've been making an effort to build an extensive bibtex database because it provides everything I need in terms of references and notes. What I do is read a paper, make pretty extensive notes on it and then put them in the abstract section of Jabref so that when you use the search function for terms it searches through all the relevant text in the article for what you work on. I've also tried to put down some keywords which are related just to make sure that they're linked with the article. Then if I want to know everything I've ever read on, say, political corruption it's just a search away.

If you wanted to add papers you've not digitized your notes for then you could put in the references and just a few quick keywords. Papers you don't have you can search through google scholar to find them. It works OK.

I've also been impressed with Papers for OSX, but Jabref can move systems really easily and is GPL.

Re:Is the material copyrighted? (Score:3, Interesting)

by shaitand ( 626655 ) writes: on Wednesday April 08, 2009 @07:40PM (#27511713) Journal

According to cornell, limitations on copyright holders are as follows. Note that research and teaching are both explicitly stated cases of fair use exemption.
107
Permits the âoefair useâ of an ownerâ(TM)s work without permission â" for the purpose of âoecriticism, comment, news reporting, teaching, scholarship, or research.â This exemption outlines four factors that must be met in order to argue a fair use.
108
Permits a library or archives to reproduce works for archiving purposes, to make copies for patrons and to participate in interlibrary loan â" all without permission
109
Permits individuals to lend, give or sell copies of works they own without seeking permission of the copyright holder. This is also referred to as the First Sale Doctrine.
110
Permits displays of work and educational performances in face-to-face teaching and distance education. The TEACH Act expands upon the limitations in section 110.
121
Permits reproduction of works without permission of the copyright holder for the blind and other people with disabilities
http://www.copyright.gov/title17/92chap1.html#107 [copyright.gov]
The copyright act section 107. This section lists many cases of fair use but gives 4 primary criteria for courts to consider. The first is the purpose of the work and makes it clear that non-profit educational use is protected. I am unable to find any reference to a classroom in section 107 (not that there is reason to think the professor doesn't teach his students by having them perform or assist with research in the classroom).

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Building a Searchable Literature Archive With Keywords? 211

Building a Searchable Literature Archive With Keywords? More Login

Building a Searchable Literature Archive With Keywords?

Re:Document Management Software and OCR (Score:3, Interesting)

Personal Document Management (Score:4, Interesting)

Summation (Score:2, Interesting)

Comment removed (Score:4, Interesting)

Re:Document Management Software and OCR (Score:3, Interesting)

Re:Reference management software (Score:2, Interesting)

Re:Reference management software (Score:3, Interesting)

Re:Is the material copyrighted? (Score:3, Interesting)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot