Building a Searchable Literature Archive With Keywords? 211
Sooner Boomer writes "I'm trying to help drag a professor I work with into the 20th century. Although he is involved in cutting-edge research (nanotechnology), his method of literature search is to begin with digging through the hundreds of 3-ring binders that contain articles (usually from PDFs) that he has printed out. Even though the binders are labeled, the articles can only go under one 'heading' and there's no way to do a keyword search on subject, methods, materials, etc. Yeah, google is pretty good for finding stuff, as are other on-line literature services, but they only work for articles that are already on-line. His literature also includes articles copied from books, professional correspondence, and other sources. Is there a FOSS database or archive method (preferably with a web interface) where he could archive the PDFs and scanned documents and be able to search by keywords? It would also be nice to categorize them under multiple subject headings if possible. I know this has been covered ad nauseum with things like photos and the like, but I'm not looking at storage as such: instead I'm trying to find what's stored."
Re:Document Management Software and OCR (Score:3, Interesting)
Ugh... for a secon there, I thought Clippy started posting to slashdot. Would you like help with that?
I don't think he's looking for a DMS, which includes lots of things like workflows, audit trails, etc. DMSs are typically used to make an office go paperless, but he's not looking for a processing and tracking mechanism. He's looking for an easy way to create a searchable archive index.
My suggestion? Since he's a professor, get a bunch of students to "help with research" by scanning the docs and OCRing them. If he's willing to shell out a few hundred bucks from his research grant, there are services that will do this... Most of the best OCR tools are proprietary, not open source, but even a crappy one should get enough text that the OCRed files could be indexed usefully.
For an indexer, I've heard good things about MPS, and a friend did a similar project to yours with Yaz/Zebra, but he was working with a library, there may have been a special reason for that.
Personal Document Management (Score:4, Interesting)
I am hoping that someone will make a nice personal document management package as free software.
If you use Windows, you can buy this:
http://www.nuance.com/paperport/ [nuance.com]
The basic features would be:
In a perfect world, the GNOME guys and the KDE guys would both start competing over who can make the slickest product and we all would win.
steveha
Summation (Score:2, Interesting)
Law firms use a program called Summation to do this all the time. They take all the paper docs and electronic docs in a case (sometimes tens of thousands of pages) and load them into this program as TIFFs or PDFs. They are then OCR searchable. Not nearly as good of a search algo as something like google, as it is purely Boolean...but it gets the job done. Not sure about cost, but your university may have a license. An alternative is a program called Concordance, which does the same thing. One last option would be to scan everything to OCR searchable PDFs, throw them into a folder, and setup google desktop to only search that folder...you could then essentially "google" the contents of all those PDFs.
Comment removed (Score:4, Interesting)
Re:Document Management Software and OCR (Score:3, Interesting)
Another obvious, more practical but potentially less powerful method is simply to index all of the printouts with serial numbers and manually create a database of tags with serial numbers.
You then have a library card index system, but electronic. Sure, it won't help with documents that aren't entered properly, but it's dramatically more efficient than thumbing through PDF's until you find what you need.
Re:Reference management software (Score:2, Interesting)
Re:Reference management software (Score:3, Interesting)
If you wanted to add papers you've not digitized your notes for then you could put in the references and just a few quick keywords. Papers you don't have you can search through google scholar to find them. It works OK.
I've also been impressed with Papers for OSX, but Jabref can move systems really easily and is GPL.
Re:Is the material copyrighted? (Score:3, Interesting)
According to cornell, limitations on copyright holders are as follows. Note that research and teaching are both explicitly stated cases of fair use exemption.
107
Permits the âoefair useâ of an ownerâ(TM)s work without permission â" for the purpose of âoecriticism, comment, news reporting, teaching, scholarship, or research.â This exemption outlines four factors that must be met in order to argue a fair use.
108
Permits a library or archives to reproduce works for archiving purposes, to make copies for patrons and to participate in interlibrary loan â" all without permission
109
Permits individuals to lend, give or sell copies of works they own without seeking permission of the copyright holder. This is also referred to as the First Sale Doctrine.
110
Permits displays of work and educational performances in face-to-face teaching and distance education. The TEACH Act expands upon the limitations in section 110.
121
Permits reproduction of works without permission of the copyright holder for the blind and other people with disabilities
http://www.copyright.gov/title17/92chap1.html#107 [copyright.gov]
The copyright act section 107. This section lists many cases of fair use but gives 4 primary criteria for courts to consider. The first is the purpose of the work and makes it clear that non-profit educational use is protected. I am unable to find any reference to a classroom in section 107 (not that there is reason to think the professor doesn't teach his students by having them perform or assist with research in the classroom).