Building a Searchable Literature Archive With Keywords? 211
Sooner Boomer writes "I'm trying to help drag a professor I work with into the 20th century. Although he is involved in cutting-edge research (nanotechnology), his method of literature search is to begin with digging through the hundreds of 3-ring binders that contain articles (usually from PDFs) that he has printed out. Even though the binders are labeled, the articles can only go under one 'heading' and there's no way to do a keyword search on subject, methods, materials, etc. Yeah, google is pretty good for finding stuff, as are other on-line literature services, but they only work for articles that are already on-line. His literature also includes articles copied from books, professional correspondence, and other sources. Is there a FOSS database or archive method (preferably with a web interface) where he could archive the PDFs and scanned documents and be able to search by keywords? It would also be nice to categorize them under multiple subject headings if possible. I know this has been covered ad nauseum with things like photos and the like, but I'm not looking at storage as such: instead I'm trying to find what's stored."
Reference management software (Score:3, Insightful)
Cheap scanner, expensive OCR software (Score:5, Insightful)
Most highend consumer All in One printers comes with an ADF capable of handling most types of paper as long as it's not crumpled up, stapled or the like. Some of the more expensive ones can do two sided scanning to a network repository. I work with consumer level HP printers, and the Office Jet Pro L7xxx series does this. The Pro L7680 is 200 US$ at Newegg.com
Now, while that printer comes with some okay OCR software, it's basicly thrown in for free. A lot of the stuff in the kind of documents you're talking about is going to be math heavy mixed in with images, graphs, tables and personal notes. I don't know any OCR software that'll transform that into exact replicas via LaTeX or the like, I'm pretty sure the really expensive OCR software will translate the written text and reproduce the rest as images and neatly transform it into some easily searchable pdf-documents.
That brings you from paper to searchable pdf-files. Catagorizing those is probably not all that hard. I'd suspect you could do some text analysis and break each document down into a list of technical terms and the number of times they're used.
A document that uses the cashmir effect in a single example is probably not a document related to that specific field, whereas documents that talk about it repeatedly, referencing known articles on the subject etc. is. Sorting that out ... beyond my knowledge.
I'd suggest you start out with an experiment. Take a "typical" page from the binders, scan it to a non-compressed image at a decent resolution (e.g. TIFF). We usually reccomend around 300 dpi for OCR - beyond that you start picking up things that we don't really look for when we're reading.
Test that page against various OCR software, see what they reproduce as the output. Pick the one that's the best result.
And don't worry - the OCR software is going to be the single most expensive purchase in this equation. I am however more than ready to be proven wrong in that regard.
Re:Document Management Software and OCR (Score:4, Insightful)
Re:Is the material copyrighted? (Score:3, Insightful)
Copying excerpts for educational use is actually an explicitly protected fair use case. The copyright act actually uses it as an example if I remember correctly.
The parent said he is copying parts of texts, not entire books.
Papers for Mac OS X and iPhone (Score:3, Insightful)
For Mac OS X, try Papers [mekentosj.com]. There's also an iPhone/iPod Touch version. Mac OS is great at handling PDFs in general.
DevonThink on a Mac (Score:3, Insightful)
If your professor uses a Mac, consider Devonthink by DevonTechnologies.
http://www.devon-technologies.com/products/devonthink/index.html [devon-technologies.com]
For searching, the software has an artificial intelligence system, keywords, meta data. It can store PDFs, word docs, emails, notes. It can be integrated with a scanner so you can scan and store documents in the database. It's got OCR built in...
I have DevonThink (personal edition, not Pro/Office) and I don't even use 1/10 of the power built into this system. You should check out some of the reviews online and videos of people using DevonThink.
Re:Document Management Software and OCR (Score:3, Insightful)
That depends, if all he needs the OCR for is to build a searchable keyword index then the error rate can be quite high and still get good results. The results of a query should point to the original PDF, not the result of the OCR.
wow.. (Score:4, Insightful)
2 years? ago I bought for my small business a fujitu f1-5120c duplex scanner--it came with adobe acrobat
I scan every bill, correspondence, notice, and everything to pdf- then I throw it the hell away.
the version of acrobat included does OCR-I open acrobat, choose create pdf from scanner, and scan away.
I can mix a scan job up between B&W & color or duplex or simplex within one job
I can open an existing PDF and append to it
I save everything to an infrant nas box.
I can go to windows search, type in 1179.21 (actually did this one once)
set to look INSIDE the files of that directory and get results that include
a soda delievery notice, a soda invoice, and my bank statement where I paid it off
they have other model scanners that combine sheetfed+flatbed...
here is a beauty
http://www.fujitsu.com/us/services/computing/peripherals/scanners/workgroup/fi-6230.html [fujitsu.com]
Suggestion (Score:4, Insightful)
I wrote and maintain a project to do this:
http://sourceforge.net/projects/docdb-v/ [sourceforge.net]
"DocDB is a powerful and flexible collaborative web based document server which maintains a versioned list of documents. Information maintained in the database includes, author(s), title, topic(s), abstract, access restriction information, etc."
It's intended for collaborations, but groups from 5 to 500 use it.
Re:Document Management Software and OCR (Score:2, Insightful)
Digital Archive software (Score:2, Insightful)
Why Not To (Score:5, Insightful)
There's at least two reasons the professor's method is beneficial:
1. By having to search by hand and scan by eye, he becomes more familiar with more of what's actually in the papers. His familiarity with the material gets better.
2. Repetitive scanning/searching of the papers leads to the mind partially wandering while doing so. This can result in inspiration and intuitive leaps.
Both methods together are preferable. But good luck on getting the professor to use them. You may have better luck getting him to create his own indices or tables of contents on paper to put in the binders. With his familiarity it shouldn't be too difficult.
I wrote a few articles about this (Score:3, Insightful)
I wrote a few articles about this for Law Office Computing magazine
http://www.nasw.org/users/nbauman/txtsrch.htm [nasw.org]
http://www.nasw.org/users/nbauman/lawdb.htm [nasw.org]
http://www.nasw.org/users/nbauman/discover.htm [nasw.org]
It was a long time ago, the software and hardware has changed, but the concepts are still the same, and the costs are a lot less.
Free text search works reasonably well with small databases, but it doesn't work with big databases. If you want precision, you have to develop a set of tags (we called them keywords). A good model is Pubmed http://www.ncbi.nlm.nih.gov/sites/entrez?db=pubmed [nih.gov]. The New York Times used to have a great text search, but they changed it (eliminated tags) and now it's awfully difficult to get through.
Basically, the researcher has a body of knowledge, and he already has a filing and organizing system (in this case, a looseleaf binder system, which is a pretty good start). You should usually try to replicate his filing and organizing system in the database, for example one field for the looseleaf, another field for the tab, and then some goodies that he couldn't search the looseleafs for, like date, author, journal citation, etc. It would probably be useful to have a controlled vocabulary of a few good keywords, but keywords should be selected carefully so they're unique and don't duplicate.
I assume he doesn't have the PDFs any more. That would have made it a lot easier.
It would be handy to scan every word of (most) every document into full text, but it may not be necessary. Why do you need everything in full digital text? Scanning of unconventional text takes a human proofreading step, and probably isn't worth it.
He'll probably want to keep complete images of the original documents anyway along with the text. You should do a few tests to see how much resolution you need. 600 dpi works for ordinary text like they use in the printed newspaper. But if you want journal articles to come out, with footnotes and superscripts, you might need higher resolution.
Somebody is going to have to enter the fields manually, which is't too bad if you've got a thousand records (looseleaf tabs) to enter (about 20 hours), but can get difficult if you've got an order of magnitude more.
Scanning should be straightforward, if everything is neatly filed away in looseleaf books already. There are many cheap consumer-grade scanners on the market that can get 600-2400 dpi (the bundled software is probably more important than the hardware specs) but they can take up to 1 minute a page; there are more expensive scanners in the =>$1,000 range that can go a lot faster. If you're at a university, look around for somebody who already has one. Law firms and libraries do a lot of this.
You might start by estimating the number of pages and documents you have.
But let me suggest an alternative: Instead of scanning everything, just enter everything into a database without scanning it. Does he really need full text search? Or would it be enough to search his looseleaf books by a dozen fields? He doesn't have to print the document out from an image file, it's right there in his looseleaf books.
If anybody knows of up-to-date articles on this subject, I'd love to know the citation.
Re:Document Management Software and OCR (Score:5, Insightful)
Re:Document Management Software and OCR (Score:2, Insightful)
Re:Document Management Software and OCR (Score:3, Insightful)
It sounds like 'going paperless" is exactly what he wants. It so happens that his office deals in research materials, but that doesn't really change the objective: stop using unwieldy and resource intensive paper documents in favor of highly indexed digital ones.
Out of curiosity, have you considered a wikiesque system... autolinking titles and keywords between articles and some sort of glossary (but not the "let's have everybody able to edit it because everything is an opinion and all opinions are equally invalid" communal happy horseshit)? It might not be any more useful to the professor as a research tool, but it could certainly be useful as an educational tool, particularly for undergrads who might not remember every term and principle right off the top of their heads or know off-hand what else they should be looking at to help grok what they're reading.