Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
×
Data Storage Databases Programming Software IT

Building a Searchable Literature Archive With Keywords? 211

Sooner Boomer writes "I'm trying to help drag a professor I work with into the 20th century. Although he is involved in cutting-edge research (nanotechnology), his method of literature search is to begin with digging through the hundreds of 3-ring binders that contain articles (usually from PDFs) that he has printed out. Even though the binders are labeled, the articles can only go under one 'heading' and there's no way to do a keyword search on subject, methods, materials, etc. Yeah, google is pretty good for finding stuff, as are other on-line literature services, but they only work for articles that are already on-line. His literature also includes articles copied from books, professional correspondence, and other sources. Is there a FOSS database or archive method (preferably with a web interface) where he could archive the PDFs and scanned documents and be able to search by keywords? It would also be nice to categorize them under multiple subject headings if possible. I know this has been covered ad nauseum with things like photos and the like, but I'm not looking at storage as such: instead I'm trying to find what's stored."
This discussion has been archived. No new comments can be posted.

Building a Searchable Literature Archive With Keywords?

Comments Filter:
  • by Red Flayer ( 890720 ) on Wednesday April 08, 2009 @04:33PM (#27509073) Journal

    I think what you are looking for is something called "document management" software.

    Ugh... for a secon there, I thought Clippy started posting to slashdot. Would you like help with that?

    I don't think he's looking for a DMS, which includes lots of things like workflows, audit trails, etc. DMSs are typically used to make an office go paperless, but he's not looking for a processing and tracking mechanism. He's looking for an easy way to create a searchable archive index.

    My suggestion? Since he's a professor, get a bunch of students to "help with research" by scanning the docs and OCRing them. If he's willing to shell out a few hundred bucks from his research grant, there are services that will do this... Most of the best OCR tools are proprietary, not open source, but even a crappy one should get enough text that the OCRed files could be indexed usefully.

    For an indexer, I've heard good things about MPS, and a friend did a similar project to yours with Yaz/Zebra, but he was working with a library, there may have been a special reason for that.

  • by steveha ( 103154 ) on Wednesday April 08, 2009 @04:38PM (#27509157) Homepage

    I am hoping that someone will make a nice personal document management package as free software.

    If you use Windows, you can buy this:

    http://www.nuance.com/paperport/ [nuance.com]

    The basic features would be:

    • Scan in a document (group multiple pages into a single PDF)
    • Easily scan a page and insert it into a pre-existing PDF (if you missed a page yesterday, today go back and put it in)
    • OCR the documents and provide an index to allow searching
    • Provide a really convenient photocopier feature (scan+print)
    • Fast and easy. Scan in color, but detect black-and-white and auto-convert to greyscale. Do not pop up any dialogs; when the user clicks on the "Scan!" button, start scanning.
    • Also allow dropping in saved HTML pages, OpenOffice.org documents, etc. Manage the user's saved documents, no matter what kind of documents they are.

    In a perfect world, the GNOME guys and the KDE guys would both start competing over who can make the slickest product and we all would win.

    steveha

  • Summation (Score:2, Interesting)

    by Anonymous Coward on Wednesday April 08, 2009 @04:40PM (#27509193)

    Law firms use a program called Summation to do this all the time. They take all the paper docs and electronic docs in a case (sometimes tens of thousands of pages) and load them into this program as TIFFs or PDFs. They are then OCR searchable. Not nearly as good of a search algo as something like google, as it is purely Boolean...but it gets the job done. Not sure about cost, but your university may have a license. An alternative is a program called Concordance, which does the same thing. One last option would be to scan everything to OCR searchable PDFs, throw them into a folder, and setup google desktop to only search that folder...you could then essentially "google" the contents of all those PDFs.

  • Comment removed (Score:4, Interesting)

    by account_deleted ( 4530225 ) on Wednesday April 08, 2009 @04:45PM (#27509283)
    Comment removed based on user account deletion
  • by digitalunity ( 19107 ) <digitalunityNO@SPAMyahoo.com> on Wednesday April 08, 2009 @05:06PM (#27509599) Homepage

    Another obvious, more practical but potentially less powerful method is simply to index all of the printouts with serial numbers and manually create a database of tags with serial numbers.

    You then have a library card index system, but electronic. Sure, it won't help with documents that aren't entered properly, but it's dramatically more efficient than thumbing through PDF's until you find what you need.

  • by eggy78 ( 1227698 ) on Wednesday April 08, 2009 @05:06PM (#27509605)
    I never really enjoyed using JabRef, but have had pretty good luck with Aigaion [aigaion.nl]... a little more setup but it's great for our lab, where everyone works from a common database of papers. It allows export to RIS, BibTex, etc. although we do occasionally run into some errors with the LaTeX special characters and such. At least as far as our advisor is concerned, this absolutely revolutionized the way we handled our references. It's searchable, you can add keywords, your own annotations, include abstracts, and upload one or more attachments (the original paper) in whatever format you want. Technically it's an annotated bibliography that supports attachments, but it is pretty solid. One thing to note: We are still using a 1.3.x version of it; we haven't been brave enough or had the time to try the 2.x releases.
  • by joe 155 ( 937621 ) on Wednesday April 08, 2009 @05:34PM (#27510007) Journal
    I agree. I'm in the first year of my PhD and I've been making an effort to build an extensive bibtex database because it provides everything I need in terms of references and notes. What I do is read a paper, make pretty extensive notes on it and then put them in the abstract section of Jabref so that when you use the search function for terms it searches through all the relevant text in the article for what you work on. I've also tried to put down some keywords which are related just to make sure that they're linked with the article. Then if I want to know everything I've ever read on, say, political corruption it's just a search away.

    If you wanted to add papers you've not digitized your notes for then you could put in the references and just a few quick keywords. Papers you don't have you can search through google scholar to find them. It works OK.

    I've also been impressed with Papers for OSX, but Jabref can move systems really easily and is GPL.
  • by shaitand ( 626655 ) on Wednesday April 08, 2009 @07:40PM (#27511713) Journal

    According to cornell, limitations on copyright holders are as follows. Note that research and teaching are both explicitly stated cases of fair use exemption.

    107
    Permits the âoefair useâ of an ownerâ(TM)s work without permission â" for the purpose of âoecriticism, comment, news reporting, teaching, scholarship, or research.â This exemption outlines four factors that must be met in order to argue a fair use.

    108
    Permits a library or archives to reproduce works for archiving purposes, to make copies for patrons and to participate in interlibrary loan â" all without permission

    109
    Permits individuals to lend, give or sell copies of works they own without seeking permission of the copyright holder. This is also referred to as the First Sale Doctrine.

    110
    Permits displays of work and educational performances in face-to-face teaching and distance education. The TEACH Act expands upon the limitations in section 110.

    121
    Permits reproduction of works without permission of the copyright holder for the blind and other people with disabilities

    http://www.copyright.gov/title17/92chap1.html#107 [copyright.gov]

    The copyright act section 107. This section lists many cases of fair use but gives 4 primary criteria for courts to consider. The first is the purpose of the work and makes it clear that non-profit educational use is protected. I am unable to find any reference to a classroom in section 107 (not that there is reason to think the professor doesn't teach his students by having them perform or assist with research in the classroom).

Happiness is twin floppies.

Working...