Building a Searchable Literature Archive With Keywords?

Follow Slashdot blog updates by subscribing to our blog RSS feed

Building a Searchable Literature Archive With Keywords? 211

Posted by timothy on Wednesday April 08, 2009 @04:20PM from the must-be-in-here-somewhere dept.

Sooner Boomer writes "I'm trying to help drag a professor I work with into the 20th century. Although he is involved in cutting-edge research (nanotechnology), his method of literature search is to begin with digging through the hundreds of 3-ring binders that contain articles (usually from PDFs) that he has printed out. Even though the binders are labeled, the articles can only go under one 'heading' and there's no way to do a keyword search on subject, methods, materials, etc. Yeah, google is pretty good for finding stuff, as are other on-line literature services, but they only work for articles that are already on-line. His literature also includes articles copied from books, professional correspondence, and other sources. Is there a FOSS database or archive method (preferably with a web interface) where he could archive the PDFs and scanned documents and be able to search by keywords? It would also be nice to categorize them under multiple subject headings if possible. I know this has been covered ad nauseum with things like photos and the like, but I'm not looking at storage as such: instead I'm trying to find what's stored."

This discussion has been archived. No new comments can be posted.

Building a Searchable Literature Archive With Keywords?

Load All Comments

Search 211 Comments Log In/Create an Account

Comments Filter:

Document Management Software and OCR (Score:5, Informative)

by eldavojohn ( 898314 ) * writes: <eldavojohn@noSpAM.gmail.com> on Wednesday April 08, 2009 @04:21PM (#27508859) Journal

I think what you are looking for is something called "document management [google.com]" software. As far as FOSS goes, KnowledgeTree [knowledgetree.com] offers a community version that might be down your alley. They have an online demo if you're interested. There's also Alfresco [alfresco.com] but I haven't tried either of these.

From the sound of it, you want to verify that your product supports document tagging (not unlike Slashdot's tagging system I guess) so that he can attach his categories to documents as he puts them in (or more likely as you do the manual labor, right?).

... where he could archive the PDFs and scanned documents and be able to search by keywords?
So, my big concern is the part where you said he scans things from books and articles and so some of the PDFs might just be massive images, right? I don't think you're going to find systems with OCR built in so you might have quite the chore on your hands. If you don't have it electronically or if it's just an image electronically, you may have to implement some sort of process for getting a doc into this system so it can be searched, right? Look into GOCR [sourceforge.net] or Tesseract [linuxjournal.com] if this is the case.

Also, judging by your nickname ("Sooner Boomer"), you're at the University of Oklahoma. Why in the world would you name yourself after a group of people [wikipedia.org] who not only disobeyed the Indian Appropriation Act but also moved out onto Native American territory before it was officially declared property of the United States? And then you also chose "Boomer" which refers to "white settlers who believed the Unassigned Lands were public property and open to anyone for settlement, not just Indian tribes. Their reasoning came from a clause in the Homestead Act of 1862, which said that any settler could claim 160 acres of public land. Some boomers entered and were removed more than once by the United States Army." If you are a descendant of either a Sooner or a Boomer, I respectfully do not agree with their actions.

Share
twitter facebook
- Re: (Score:2)
  
  by CyberLord Seven ( 525173 ) writes:
  
  I wish I had mod-points.
  Very well done and informative on all counts.
  - Re: (Score:3, Interesting)
    
    by digitalunity ( 19107 ) writes:
    
    Another obvious, more practical but potentially less powerful method is simply to index all of the printouts with serial numbers and manually create a database of tags with serial numbers.
    You then have a library card index system, but electronic. Sure, it won't help with documents that aren't entered properly, but it's dramatically more efficient than thumbing through PDF's until you find what you need.
- Re:Document Management Software and OCR (Score:5, Funny)
  
  by qoncept ( 599709 ) writes: on Wednesday April 08, 2009 @04:26PM (#27508959) Homepage
  
  If you are a descendant of either a Sooner or a Boomer, I respectfully do not agree with their actions.
  Except he's not. He just prematurely ejaculates. And he'd gone all this time with no one drawing attention to it as you just have.
  
  Parent Share
  twitter facebook
- Re: (Score:2)
  
  by WillKemp ( 1338605 ) writes:
  
  Look into GOCR [sourceforge.net] or Tesseract [linuxjournal.com] if this is the case.
  Unless you're really lucky with the images, OCR requires a lot of work correcting errors. It would probably be less work to just be able to add searchable tags to whatever system is used to store the PDFs and leave them as images.
  - Re:Document Management Software and OCR (Score:4, Insightful)
    
    by Shadow Wrought ( 586631 ) * writes: <shadow.wrought@g ... minus herbivore> on Wednesday April 08, 2009 @04:39PM (#27509173) Homepage Journal
    
    OCR certainly requires work if you need it to be completely accurate. In practice, speaking as a paralegal who's overseen the OCR'ing of millions of pages, it's just not a reasonable expectation. If you can supplement it with coding, in this case keyword tags, date, author, publication and title would build a pretty strong database. If he's looking to do that already, then whatever OCR you get is gravy. Some is better than none.
    
    Parent Share
    twitter facebook
    - Re: (Score:2)
      
      by digitalunity ( 19107 ) writes:
      
      OCR, in my experience, is also crap with equations and technical literature in general. It's linguistic fuzzy matching changes technical words it doesn't recognize into similar words it does, on the basis that well shit it might have just not scanned very well.
      Not much way around this.
      - Re: (Score:3, Funny)
        
        by NoobixCube ( 1133473 ) writes:
        
        What he needs is a bunch of undergrads or interns to painstakingly transcribe and proofread every scrap and napkin of text!
        
        Re: (Score:2)
        
        by Sooner Boomer ( 96864 ) writes:
        
        You've made the same comments that a lot of other people have. In the university, and especially in the sciences and engineering fields, there are two "types" of professors; those that are teachers and do research, and those that only do research. The teachers are the ones that are on tenure track and will become a "line item" or continuously funded position. The professors that only do research are dependant on grant monies for their salaries and are "soft money" or non-line item funded. They may or ma
      - Re: (Score:2)
        
        by Shadow Wrought ( 586631 ) * writes:
        
        Very much so. And, if its photocopies of photocopies, then you start running into other OCR issues as well. Like anything else it just a tool which is sometimes helpful, but has its limitations. Depending on the volume though I'd think it would be easy enough even in Access or the OO equivilant to create a coding template with the Binder number and tab (assuming it's already so organized) and then the author, title, date, publication, and key words.
        
        I'm sure there's an undergrad somewhere who'd like a
  - Re: (Score:3, Insightful)
    
    by shaitand ( 626655 ) writes:
    
    That depends, if all he needs the OCR for is to build a searchable keyword index then the error rate can be quite high and still get good results. The results of a query should point to the original PDF, not the result of the OCR.
  - Re: (Score:2)
    
    by Hognoxious ( 631665 ) writes:
    
    Wouldn't it be even less work to get somebody else (TAs or some other form of free labour that needs to be on the prof's good side) to do it?
- Re: (Score:3, Interesting)
  
  by Red Flayer ( 890720 ) writes:
  
  I think what you are looking for is something called "document management" software.
  Ugh... for a secon there, I thought Clippy started posting to slashdot. Would you like help with that?
  
  I don't think he's looking for a DMS, which includes lots of things like workflows, audit trails, etc. DMSs are typically used to make an office go paperless, but he's not looking for a processing and tracking mechanism. He's looking for an easy way to create a searchable archive index.
  
  My suggestion? Since he's a profe
  - Re: (Score:3, Insightful)
    
    by Miseph ( 979059 ) writes:
    
    It sounds like 'going paperless" is exactly what he wants. It so happens that his office deals in research materials, but that doesn't really change the objective: stop using unwieldy and resource intensive paper documents in favor of highly indexed digital ones.
    Out of curiosity, have you considered a wikiesque system... autolinking titles and keywords between articles and some sort of glossary (but not the "let's have everybody able to edit it because everything is an opinion and all opinions are equally i
- Comment removed (Score:4, Interesting)
  
  by account_deleted ( 4530225 ) writes: on Wednesday April 08, 2009 @04:45PM (#27509283)
  
  Comment removed based on user account deletion
  
  Parent Share
  twitter facebook
- OCR is aweful (Score:2)
  
  by shaitand ( 626655 ) writes:
  
  OCR is pretty nasty stuff and it doesn't work very well at all. It's probably worth saying that the OCR results should probably only be used to generate your index and keywords.
  Actually accessing the document should show the original PDF, not the error riddled OCR scan of it.
- wow.. (Score:4, Insightful)
  
  by way2trivial ( 601132 ) writes: on Wednesday April 08, 2009 @05:01PM (#27509527) Homepage Journal
  
  2 years? ago I bought for my small business a fujitu f1-5120c duplex scanner--it came with adobe acrobat
  I scan every bill, correspondence, notice, and everything to pdf- then I throw it the hell away.
  the version of acrobat included does OCR-I open acrobat, choose create pdf from scanner, and scan away.
  I can mix a scan job up between B&W & color or duplex or simplex within one job
  I can open an existing PDF and append to it
  I save everything to an infrant nas box.
  I can go to windows search, type in 1179.21 (actually did this one once)
  set to look INSIDE the files of that directory and get results that include
  a soda delievery notice, a soda invoice, and my bank statement where I paid it off
  they have other model scanners that combine sheetfed+flatbed...
  here is a beauty
  http://www.fujitsu.com/us/services/computing/peripherals/scanners/workgroup/fi-6230.html [fujitsu.com]
  
  Parent Share
  twitter facebook
  - The bigger problem is not OCR, it's which scanner. (Score:3, Informative)
    
    by Futurepower(R) ( 558542 ) writes:
    
    We've done considerable research on the problem of scanning documents, also, and came to the same conclusion: The new Fujitsu fi-6130 [newegg.com] seems excellent, although we haven't tried it. That model and the 6230 are new, and there is some evidence that waiting for the second version of those models would be a good idea.
    
    The big attraction of the fi-6130 is its speed: 40 pages per minute.
    
    If you are interested, I suggest you download the manual [fujitsu.com]. (PDF)
    
    The manual talks about a connection for an "imprinter", wh
    - Re:The bigger problem is not OCR, it's which scann (Score:2, Informative)
      
      by Angstroman ( 747480 ) writes:
      
      I have been using an fi-6130 for several months now. It is quite simply the best scanner I have used. It is fast, highly reliable and very seldom misfeeds (1 per 500-800 pages in my experience). I use it for scanning archival financial records and also for technical papers. It includes a copy of Kofax Virtual ReScan, which does a great job of creating readable 1-bit monotone scans of originals with colored backgrounds. There are a number of possible target formats, and it has several automated ways of handl
      - Re: (Score:2)
        
        by rhsanborn ( 773855 ) writes:
        
        I work with electronic medical records and we have found Fujitsu scanners to be top-notch. Fast, reliable, and generally affordable. We've also used some of the larger production scanners from Kodak and Bowe Bell and Howell. They are solid scanners, but are more expensive and haven't taken the beating we tend to give the Fujitsus.
        
        We just replaced an old 3097 with the 6130 and are waiting to hear how it holds up. Note, most of these scanners are deployed to remote scan areas in the hospital where it is the
- Re: (Score:3, Informative)
  
  by burki ( 32245 ) writes:
  
  For an Open Source DMS that generates searchable PDF Files, try ArchivistaBox: http://sourceforge.net/projects/archivista/ [sourceforge.net]
  Tesseract (including fracture / black-letter recognition) and the Linux port of Cuneiform (BSD licence) OCR engines are used for text recognition. The hocr2pdf module (see http://www.exactcode.de/ [exactcode.de] is used to generate the searchable PDF files.
  (http://sourceforge.net/forum/forum.php?forum_id=868471)
  - Re: (Score:2, Informative)
    
    by RicRoc ( 41406 ) writes:
    
    Archivista seems to be be a solution worth looking into for him. I guess he has to try and install the software and test it himself, because the explanation on the (English) website is almost incomprehensible -- I understand spoken German, but mixing German and English, ugh!
    I work with Alfresco, which is a nice DMS, but without the integrated OCR that Archivista seems to provide. Alfresco can integrate with various OCR solutions though, has a very active community -- and a comprehensible website!
- Re: (Score:2, Insightful)
  
  by snspdaarf ( 1314399 ) writes:
  
  And, since what the Boomers and Sooners did was about 100 years ago, slamming someone for their Slashdot nickname helps with their current problem how? The Chickasaw side of my family has owned the same land since before statehood, and the German side was in on the Land Run, and we don't get all fired up about what the people were doing back then. The past is past. Learn to relax a little.
  - Re: (Score:2)
    
    by YourExperiment ( 1081089 ) writes:
    
    He wasn't criticising the guy for the actions of his ancestors, he was criticising his choice of username, which he believes glorifies the actions of those people. It's sorta like picking a username like "AdolfHitler666". Except that this guy might just be a Battlestar Galactica fan with a predilection for oriental chicks.
    - Re: (Score:2)
      
      by snspdaarf ( 1314399 ) writes:
      
      That's the point. It's a user name. One based (probably) on the words to a college fight song rather than the actions of two small groups of people in the past. Why beat him up about it?
      Nice Godwin, but I would venture to say there are very few college fight songs that reference Hitler, or Satan, so it's not really a good parallel.
      - Re: (Score:2)
        
        by YourExperiment ( 1081089 ) writes:
        
        Not much point me trying to Godwin a thread when you go and reply anyway though. :)
- Re:Document Management Software and OCR (Score:5, Insightful)
  
  by electrons_are_brave ( 1344423 ) writes: on Wednesday April 08, 2009 @08:25PM (#27512175)
  
  As an ex-librarian, I can give you a professional's answer. You need a professional. But - if that's not possible, then what you are aiming for is a dream, and a huge data entry task to boot. And you will be creating a system that he will never be able to maintain. Aim lower. Ask him - does he want to keep the paper copies or move them all onto computer. Not both. If he wants to keep the paper - it's simple. Weed weed weed. 60% of what anyone holds is rubbish, and if's available online (and I mean in a proper source not a dissapearing link) he'll find it when he needs it. (I'm thinking he can't be using much of it given the difficulty of finding it). So that will leave you with about 20 three-rings out of the hundreds. Number each document, put them in a filing cabinets by MAIN SUBJECT. If you want to spend your life typing then, by all means, use incite, the word referencing system or some simple library freeware to create a db with author, title, journal etc and main subject (or maybe two). If he wants them all digital - same deal. Scan the ones that aren't there. Forget any sort of magic software that will catalogue for you, you crazy dreamer. The best you can do is use incite or some other referencing software to search for and make a record of the ones that have the record available on line. And then type the rest in. Personally, he sounds like a hoarder, so he will probably resist both suggestions. If this is the case then sort the folders into main subject and type a list (bib reference) and stick it to the front of each. At least that will cut down on his search time - but again, it's a lot of typing.
  
  Parent Share
  twitter facebook
  - Re: (Score:2, Insightful)
    
    by electrons_are_brave ( 1344423 ) writes:
    
    BTW - my above answer is based on the assumption that he has no money to spend on getting this done.
- OCR, yes, but does he need management (Score:2)
  
  by N Monkey ( 313423 ) writes:
  
  I think what you are looking for is something called "document management [google.com]" software.
  ... where he could archive the PDFs and scanned documents and be able to search by keywords?
  I agree with the OCR requirement, but if he just needs to search the resulting PDFs, wouldn't DocSearcher [sourceforge.net] do the job for him? I've found it trivial to set up and run and it's certainly helped me keep track of docs etc.
fox? (Score:5, Funny)

by SnarfQuest ( 469614 ) writes: on Wednesday April 08, 2009 @04:26PM (#27508939)

I'm trying to help drag a professor I work with into the 20th century
Maybe after that, you should try to bring him into the 21st century. You know, the one where PDF's exist?

Share
twitter facebook
- Re:fox? (Score:5, Funny)
  
  by fuzzyfuzzyfungus ( 1223518 ) writes: on Wednesday April 08, 2009 @04:34PM (#27509089) Journal
  
  PDF has been around since 1993. That's what, six months or so after we switched from coal-fired data furnaces to vacuum tubes, right?
  
  Parent Share
  twitter facebook
- Re: (Score:3, Funny)
  
  by Hognoxious ( 631665 ) writes:
  
  Maybe wait for the 22nd. If we're lucky, by then it won't suck. But you may still have to wait for the Hurd port.
OWL (Score:2)

by LWATCDR ( 28044 ) writes:

http://owl.anytimecomm.com/ [anytimecomm.com]
we use this at my office. Works well for us.
Try Papers (Score:4, Informative)

by matt4077 ( 581118 ) writes: on Wednesday April 08, 2009 @04:26PM (#27508955) Homepage

Papers [mekentosj.com] is a Mac software that does exactly what you need, and does it very well. It's not webbased and Mac only unfortunately, but you can probably find out there what the right terms to google for are.

Share
twitter facebook
Reference management software (Score:3, Insightful)

by ckthorp ( 1255134 ) writes: on Wednesday April 08, 2009 @04:26PM (#27508957)

I've had good luck with JabRef which uses a BibTex database on the backend so it integrates very well with LaTeX.

Share
twitter facebook
- Re: (Score:2, Interesting)
  
  by eggy78 ( 1227698 ) writes:
  
  I never really enjoyed using JabRef, but have had pretty good luck with Aigaion [aigaion.nl]... a little more setup but it's great for our lab, where everyone works from a common database of papers. It allows export to RIS, BibTex, etc. although we do occasionally run into some errors with the LaTeX special characters and such. At least as far as our advisor is concerned, this absolutely revolutionized the way we handled our references. It's searchable, you can add keywords, your own annotations, include abstracts, and
- Re: (Score:3, Interesting)
  
  by joe 155 ( 937621 ) writes:
  
  I agree. I'm in the first year of my PhD and I've been making an effort to build an extensive bibtex database because it provides everything I need in terms of references and notes. What I do is read a paper, make pretty extensive notes on it and then put them in the abstract section of Jabref so that when you use the search function for terms it searches through all the relevant text in the article for what you work on. I've also tried to put down some keywords which are related just to make sure that they
  - Re: (Score:2, Informative)
    
    by cahkaylahlee ( 1298367 ) writes:
    
    You can also set your preferences in Google Scholar to provide a link to the paper's citation in BibTex formatting. It isn't always 100% accurate, but it does save a lot of data entry time.
  - Re: (Score:2)
    
    by rhsanborn ( 773855 ) writes:
    
    If you wouldn't mind satiating my curiosity. In what format do you take notes? Do you keep the original copy of the paper? Do you link, in some way, a specific point or note to a specific place in the paper?
    
    I'm just digging into my masters and haven't had the best text processing skills, so I'm interested in how others do it.
DSpace? (Score:2)

by leenks ( 906881 ) writes:

DSpace ? http://www.dspace.org/ [dspace.org]
Beagle (Score:2)

by WillKemp ( 1338605 ) writes:

It may be worth looking at Beagle: http://beagle-project.org/ [beagle-project.org] - it's Linux only though.
Zotero (Score:3, Informative)

by k2enemy ( 555744 ) writes: on Wednesday April 08, 2009 @04:35PM (#27509107)

Zotero [zotero.org] might be useful.

Share
twitter facebook
- Re: (Score:2, Informative)
  
  by hnwombat ( 172691 ) writes:
  
  I'll second this one. I'm a doctoral student, and have been using it to handle my research. A nice, simple firefox-based interface. It'll snarf references right off pages from search engines. You can attach things, including links to or copies of pdfs to those references, summaries, etc. You can apply keyword tags to citations, and you can organize the citations into a nice directory tree.
  To get them out, there's a sweet interface available for open office. I think it's also available for Word, but I use M$
  - Re: (Score:2)
    
    by langelgjm ( 860756 ) writes:
    
    You can export some or all your references to a file, sneakernet the file to the new computer, import it into zotero, and you're done.
    If you're brave, you can try the beta, which includes syncing functionality. I currently use it to sync between my XP desktop and my OS X laptop, and I haven't had any problems so far (though I make a point of it to back up the database regularly in case something goes wrong).
    The Word integration for Office 2007 is fantastic; for Office 2008 on the Mac, you need Leopard, and it's slightly clunkier, but works.
  - Re: (Score:2)
    
    by Trepidity ( 597 ) writes:
    
    The main thing that turns me off about Zotero is the poor browsing interface. Why can't I click on an author's name to get all papers in my database by that author? Why can't I click on a journal to get all papers in that journal? Why do I always have to go through a damn Advanced Search to do any of this?
    As a result, I use Aigaion [aigaion.nl] instead. It's web-based, though, so you'll need somewhere you can run it (I've got a small VPS that it's on).
- Re: (Score:2, Informative)
  
  by yes it is ( 1137335 ) writes:
  
  Thirded. I've also built my own phrase index on top of zotero using Perl and OpenCalais.
- Re: (Score:2)
  
  by edward2020 ( 985450 ) writes:
  
  I'll even fourth it.
Cheap scanner, expensive OCR software (Score:5, Insightful)

by MartinSchou ( 1360093 ) writes: on Wednesday April 08, 2009 @04:35PM (#27509109)

Most highend consumer All in One printers comes with an ADF capable of handling most types of paper as long as it's not crumpled up, stapled or the like. Some of the more expensive ones can do two sided scanning to a network repository. I work with consumer level HP printers, and the Office Jet Pro L7xxx series does this. The Pro L7680 is 200 US$ at Newegg.com
Now, while that printer comes with some okay OCR software, it's basicly thrown in for free. A lot of the stuff in the kind of documents you're talking about is going to be math heavy mixed in with images, graphs, tables and personal notes. I don't know any OCR software that'll transform that into exact replicas via LaTeX or the like, I'm pretty sure the really expensive OCR software will translate the written text and reproduce the rest as images and neatly transform it into some easily searchable pdf-documents.
That brings you from paper to searchable pdf-files. Catagorizing those is probably not all that hard. I'd suspect you could do some text analysis and break each document down into a list of technical terms and the number of times they're used.
A document that uses the cashmir effect in a single example is probably not a document related to that specific field, whereas documents that talk about it repeatedly, referencing known articles on the subject etc. is. Sorting that out ... beyond my knowledge.
I'd suggest you start out with an experiment. Take a "typical" page from the binders, scan it to a non-compressed image at a decent resolution (e.g. TIFF). We usually reccomend around 300 dpi for OCR - beyond that you start picking up things that we don't really look for when we're reading.
Test that page against various OCR software, see what they reproduce as the output. Pick the one that's the best result.
And don't worry - the OCR software is going to be the single most expensive purchase in this equation. I am however more than ready to be proven wrong in that regard.

Share
twitter facebook
- Re: (Score:2)
  
  by serviscope_minor ( 664417 ) writes:
  
  scan it to a non-compressed image at a decent resolution (e.g. TIFF)
  Why noncompressed? Not all compression is lossy. The G4 lossless compression for binary images in TIFF is excellent, and artifact free since it's lossless.
Personal Document Management (Score:4, Interesting)

by steveha ( 103154 ) writes: on Wednesday April 08, 2009 @04:38PM (#27509157) Homepage
I am hoping that someone will make a nice personal document management package as free software.
If you use Windows, you can buy this:
http://www.nuance.com/paperport/ [nuance.com]
The basic features would be:
- Scan in a document (group multiple pages into a single PDF)
- Easily scan a page and insert it into a pre-existing PDF (if you missed a page yesterday, today go back and put it in)
- OCR the documents and provide an index to allow searching
- Provide a really convenient photocopier feature (scan+print)
- Fast and easy. Scan in color, but detect black-and-white and auto-convert to greyscale. Do not pop up any dialogs; when the user clicks on the "Scan!" button, start scanning.
- Also allow dropping in saved HTML pages, OpenOffice.org documents, etc. Manage the user's saved documents, no matter what kind of documents they are.
In a perfect world, the GNOME guys and the KDE guys would both start competing over who can make the slickest product and we all would win.
steveha
Share
twitter facebook
Summation (Score:2, Interesting)

by Anonymous Coward writes:

Law firms use a program called Summation to do this all the time. They take all the paper docs and electronic docs in a case (sometimes tens of thousands of pages) and load them into this program as TIFFs or PDFs. They are then OCR searchable. Not nearly as good of a search algo as something like google, as it is purely Boolean...but it gets the job done. Not sure about cost, but your university may have a license. An alternative is a program called Concordance, which does the same thing. One last opt
Quick and dirty solution (Score:3, Informative)

by oldhack ( 1037484 ) writes: on Wednesday April 08, 2009 @04:40PM (#27509205)

Assuming you have electronic versions of the documents in one format or another, stick them all in a file system and use desktop search (MS or Google). More than that you're looking at good bit of time and money.

Share
twitter facebook
- Re: (Score:2, Funny)
  
  by pete-classic ( 75983 ) writes:
  
  Maybe you're unfamiliar with three ring binders [keyproductsinc.com].
  They're archaic devices used to store non-electronic paper-based documents. You can ask your granddad about them.
  I'm beginning to think these kids today don't realize that the desktop metaphor is . . . a metaphor!
  -Peter
  - Re:Quick and dirty solution (Score:4, Funny)
    
    by oldhack ( 1037484 ) writes: on Wednesday April 08, 2009 @05:22PM (#27509855)
    
    I am a granddad, you insensitive clod.
    
    Parent Share
    twitter facebook
Beagle or Google Desktop (Score:2)

by janoc ( 699997 ) writes:

I am using Beagle and/or Google Desktop for exactly this task. Both are able to index PDFs and search them. Unfortunately, they will not deal with PDFs directly from scanner (large images), you need to process those with OCR first. I believe that both Beagle and Google Desktop are able to search the metadata too, so even for image documents you can still search authors and titles if you are diligent and fill them in when the document is scanned. This needs a bit of discipline and insight into how the data a
(Let me google that for you)^2 (Score:2, Offtopic)

by clinko ( 232501 ) writes:

If only GOOGLE had a way to search your DESKTOP [google.com], that would be perfect.
Papers for Mac OS X and iPhone (Score:3, Insightful)

by 200_success ( 623160 ) writes: on Wednesday April 08, 2009 @04:48PM (#27509321)

For Mac OS X, try Papers [mekentosj.com]. There's also an iPhone/iPod Touch version. Mac OS is great at handling PDFs in general.

Share
twitter facebook
DevonThink on a Mac (Score:3, Insightful)

by Autumnmist ( 80543 ) writes: on Wednesday April 08, 2009 @04:51PM (#27509385)

If your professor uses a Mac, consider Devonthink by DevonTechnologies.
http://www.devon-technologies.com/products/devonthink/index.html [devon-technologies.com]
For searching, the software has an artificial intelligence system, keywords, meta data. It can store PDFs, word docs, emails, notes. It can be integrated with a scanner so you can scan and store documents in the database. It's got OCR built in...
I have DevonThink (personal edition, not Pro/Office) and I don't even use 1/10 of the power built into this system. You should check out some of the reviews online and videos of people using DevonThink.

Share
twitter facebook
Almost recent diss on the subject (Score:2)

by foobsr ( 693224 ) writes:

Bibliography Tools in the Context of WWW and LATEX [citeulike.org]

Looks like that covers your needs.

CC.
So, what I think you're asking for is... (Score:5, Informative)

by Basilius ( 184226 ) writes: on Wednesday April 08, 2009 @05:00PM (#27509515)

...something like this:
1. You want to be able to store documents that currently exist electronically, and also handle documents you're going to scan. The latter may, or may not, be OCR'd.
2. You want to attach keywords to the articles, and be able to bring up a list of articles that match some arbitrary combination of these keywords.
3. Full-text search isn't as important (but would be useful if available).
If that's the case, I'm thinking Alfresco [alfresco.com] might be what you're looking for. Multi-platform, open source, java-based content repository. Supports document tagging (and loads, loads more). Relatively easy to use right out of the box, and has a CIFS interface so you can just create a project and simply tree-copy your current documents into the project. Don't let the "enterprise" designation on the software scare you away.
I've actually considered going that route for my own personal document library, but while Alfresco might be one of the only good solutions, it's like killing a fly with a cannon.
I'm frankly amazed that with the "paperless living" meme currently going through the productivity circles that someone hasn't come up with a simple tool to do something just like what you're looking for: point it at a root folder, let it suck in all the files, then start tagging away. Search with keywords or filenames or both, and provide a clickable list of hits. Full-text search isn't needed, as there's already a ton of tools out there that'll happily index your hard drive for you.
And, if a tool like that exists, could someone point me to it, please?

Share
twitter facebook
- Re: (Score:2)
  
  by datababe72 ( 244918 ) writes:
  
  There's a little company called Arctus that does something like what you asked for (tagging a directory structure), and some more. Its aimed at pharma/biotech researchers, but from what I've seen of them they'd be happy to customize for you, for a price.
- Re: (Score:2)
  
  by Sooner Boomer ( 96864 ) writes:
  
  Thanks. I looked at the web page and went down several levels looking at the features of the document management software. I'm put off somewhat by the fact that they don't list any prices. Instead, you have to contact them and they quote you a price for your bespoke application. I'll keep this in mind, and recomend it as an alternative, but for now I'm going to keep looking.
  - Re: (Score:2)
    
    by Basilius ( 184226 ) writes:
    
    That's the enterprise version of Alfresco. Check out the Labs version. Free, open source version.
Suggestion (Score:4, Insightful)

by vondo ( 303621 ) writes: on Wednesday April 08, 2009 @05:11PM (#27509705)

I wrote and maintain a project to do this:
http://sourceforge.net/projects/docdb-v/ [sourceforge.net]
"DocDB is a powerful and flexible collaborative web based document server which maintains a versioned list of documents. Information maintained in the database includes, author(s), title, topic(s), abstract, access restriction information, etc."
It's intended for collaborations, but groups from 5 to 500 use it.

Share
twitter facebook
- Re: (Score:2)
  
  by CarpetShark ( 865376 ) writes:
  
  The description of your project does not suggest either full-text searching, or PDF capabilities. I've been looking for something that can index all of my ebooks and do full-text searches that bring up the pages (or better, chapters) that are most relevant.
  - Re: (Score:2)
    
    by vondo ( 303621 ) writes:
    
    It's agnostic as to the document type (PDF/HTML/Word) whatever and can use plugins (beagle, etc) to do full-text search. Generally we find that proper meta-data to describe the document (title, abstract, topics, keywords) is much more useful than a full text search. But yeah, it won't do what you mention.
    - Re: (Score:2)
      
      by CarpetShark ( 865376 ) writes:
      
      That's probably true for cases when content is added by people who wrote the content or understand it well. My interest is in building a library of topics I DON'T know well though. For instance, I want to have ebooks available for random searches which turn up techniques and algorithms I'm unaware of, yet may prove relevant to my work, as and when I find some particular problem worth researching. It's very difficult to catalog that sort of on-demand knowledge ahead of time. Additionally, I don't care if
- Re: (Score:2)
  
  by Sooner Boomer ( 96864 ) writes:
  
  Thanks for thye recomendation. I see from the web page that some of the dependancies are perl, My SQL, and it can use a web front-end. Apache? I also see it runs on linux - Debian or Ubuntu Server (or something else?)?
  - Re: (Score:2)
    
    by vondo ( 303621 ) writes:
    
    Right, MySQL, Perl, Apache. Runs on RedHat, Ubuntu, Mac OS X. Anything Unix like.
I, Librarian seems pretty close (Score:2)

by nniillss ( 577580 ) writes:

What the submitter needs (and I also need) is an organizer for scientific papers with an interface for standard fields such as authors, journal, title, doi, http links etc. I, Librarian [bioinformatics.org] seems to fulfill this need; unfortunately with direct interfaces (for retrieving pdf and meta information at the same time) only with pubmed.
If anybody knew of (or planned for) an adaptation to physics (with interfaces to arXiv.org [arxiv.org], the APS journals [aps.org] and ideally other journals), I would be very interested (even as a paying
- - Integration with NASA ADS (Score:2)
    
    by nniillss ( 577580 ) writes:
    
    No, I (theoretical solid state physicist) didn't know about NASA ADS, but it seems to cover most of the relevant literature (including arXiv and APS journals). So yes, an integration into I, Librarian would be great.
    In CERN DS (with certainly a focus on high-energy physics) my papers are shown only up to 2006; so this database appears useless for me.
Talk to your school librarian. (Score:2)

by phallstrom ( 69697 ) writes:

Don't librarian's (particularly those in the library science realm) deal with this sort of thing all the time?
Digital Archive software (Score:2, Insightful)

by who's got my nicknam ( 841366 ) writes:

What you are looking for is a proper archiving application. I suggest ICAAtom [ica-atom.org]. Scan your documents as TIFFs if you are going to be saving them as images; if your hardware will do OCR nicely, then you would be better off scanning them to text, as they will be more searchable. ICA Atom supports all of the standard archiving metadata protocols, of course, so you will have good searching capabilities as long as you enter proper metadata.
Why Not To (Score:5, Insightful)

by DynaSoar ( 714234 ) writes: on Wednesday April 08, 2009 @05:52PM (#27510271) Journal

There's at least two reasons the professor's method is beneficial:
1. By having to search by hand and scan by eye, he becomes more familiar with more of what's actually in the papers. His familiarity with the material gets better.
2. Repetitive scanning/searching of the papers leads to the mind partially wandering while doing so. This can result in inspiration and intuitive leaps.
Both methods together are preferable. But good luck on getting the professor to use them. You may have better luck getting him to create his own indices or tables of contents on paper to put in the binders. With his familiarity it shouldn't be too difficult.

Share
twitter facebook
- Re: (Score:2)
  
  by BitZtream ( 692029 ) writes:
  
  Thank you for this post.
  I love Google, and I'm only 30 so I'm not an old school engineer or anything, but I still have a printed copy of almost every reference manual I have ever used.
  I don't really know why I prefer the dead tree format as much as I do, and I still usually keep a search able copy available at all times as well as there are times when I know what I'm looking for and can jump straight to it.
  Your first point sheds some light on my habits actually, perhaps subconsciously I knew this. Perhaps
Zotero, Mendeley (Score:2, Informative)

by pesho ( 843750 ) writes:

You should try Zotero [zotero.org] or Mendeley [mendeley.com].
Zotero is a firefox extension that can grab reserach papers directly from the journal or library web sites. It organizes the papers in collections, has keywords (they call them tags), can automatically index the PDFs. The metadata is stored also on a remote server and you can browse through it using a web interface. You also get a Word and Openoffice plugins to insert citations in the papers you write. The plugins are a little rough around the edges, but are usable. The
Try a server based solution like RefBase (Score:2)

by jjh37997 ( 456473 ) writes:

What's better than having a program on your desktop that can search through your files, how about an online database of your files that you can access from any internet connected computer in the world (www.refbase.net).
- Re: (Score:2)
  
  by mspohr ( 589790 ) writes:
  
  I've set up RefBase (www.refbase.net) for several sites. It works great and will do just what he needs. It also has the ability to generate standard format citations.
SIMPLE solution, but not FOSS (Score:2)

by macraig ( 621737 ) writes:

1a. Windows Desktop Search with added "IFilters", or
1b. Google Desktop Search
I recommend (1a), amazingly, because once you've located and installed all the third-party IFilters - including one(s) for PDF files - WDS will be able to index and make searchable MANY more files than GDS (in my case, about THREE TIMES as many). If the original PDFs from which so much of the binder material was printed are still available, then your effort with the following is greatly reduced.
2. Good major-manufacturer scanner
The University of Oklahoma participates in JSTOR (Score:2)

by tlambert ( 566799 ) writes:

The University of Oklahoma participates in JSTOR:
http://www.jstor.org/ [jstor.org]
They also appear to be EBSCO participants:
http://search.ebscohost.com/ [ebscohost.com]
I'm pretty sure "the 20th century" is right there already, if you can drag him to the library.
Note: This isn't going to work for people not affiliated with an institution. Both of these services make paper journal content available online for subscription fees paid by the institution (or business), so unless you are in the "bog boy clique", you're not going to have acc
Desktop Search? (Score:2)

by natmsincome.com ( 528791 ) writes:

Since you only have one person that needs to access the files I would just use Desktop Search. Personally I like Google Desktop ( http://desktop.google.com/ [google.com] ) & Copernic Desktop Search ( http://www.copernic.com/en/products/desktop-search/index.html [copernic.com] ). Here is an article reviewing some of them - http://lifehacker.com/400365/five-best-desktop-search-applications [lifehacker.com].

The main thing that you need to do is OCR the documents when you scan them in (You can convert non-OCR PDFs into OCR PDFs but I don't know any
I wrote a few articles about this (Score:3, Insightful)

by nbauman ( 624611 ) writes: on Wednesday April 08, 2009 @07:44PM (#27511755) Homepage Journal

I wrote a few articles about this for Law Office Computing magazine
http://www.nasw.org/users/nbauman/txtsrch.htm [nasw.org]
http://www.nasw.org/users/nbauman/lawdb.htm [nasw.org]
http://www.nasw.org/users/nbauman/discover.htm [nasw.org]
It was a long time ago, the software and hardware has changed, but the concepts are still the same, and the costs are a lot less.
Free text search works reasonably well with small databases, but it doesn't work with big databases. If you want precision, you have to develop a set of tags (we called them keywords). A good model is Pubmed http://www.ncbi.nlm.nih.gov/sites/entrez?db=pubmed [nih.gov]. The New York Times used to have a great text search, but they changed it (eliminated tags) and now it's awfully difficult to get through.
Basically, the researcher has a body of knowledge, and he already has a filing and organizing system (in this case, a looseleaf binder system, which is a pretty good start). You should usually try to replicate his filing and organizing system in the database, for example one field for the looseleaf, another field for the tab, and then some goodies that he couldn't search the looseleafs for, like date, author, journal citation, etc. It would probably be useful to have a controlled vocabulary of a few good keywords, but keywords should be selected carefully so they're unique and don't duplicate.
I assume he doesn't have the PDFs any more. That would have made it a lot easier.
It would be handy to scan every word of (most) every document into full text, but it may not be necessary. Why do you need everything in full digital text? Scanning of unconventional text takes a human proofreading step, and probably isn't worth it.
He'll probably want to keep complete images of the original documents anyway along with the text. You should do a few tests to see how much resolution you need. 600 dpi works for ordinary text like they use in the printed newspaper. But if you want journal articles to come out, with footnotes and superscripts, you might need higher resolution.
Somebody is going to have to enter the fields manually, which is't too bad if you've got a thousand records (looseleaf tabs) to enter (about 20 hours), but can get difficult if you've got an order of magnitude more.
Scanning should be straightforward, if everything is neatly filed away in looseleaf books already. There are many cheap consumer-grade scanners on the market that can get 600-2400 dpi (the bundled software is probably more important than the hardware specs) but they can take up to 1 minute a page; there are more expensive scanners in the =>$1,000 range that can go a lot faster. If you're at a university, look around for somebody who already has one. Law firms and libraries do a lot of this.
You might start by estimating the number of pages and documents you have.
But let me suggest an alternative: Instead of scanning everything, just enter everything into a database without scanning it. Does he really need full text search? Or would it be enough to search his looseleaf books by a dozen fields? He doesn't have to print the document out from an image file, it's right there in his looseleaf books.
If anybody knows of up-to-date articles on this subject, I'd love to know the citation.

Share
twitter facebook
Bookends (Score:2)

by martinX ( 672498 ) writes:

I didn't read the article and I'm not sure exactly what you want, but take a look at Bookends from Sonny Software.
Saved my butt, made life easy.
http://www.sonnysoftware.com/ [sonnysoftware.com]
Reference Management and Bibliography Software for Mac OS X
ReadIris Pro (Score:2)

by fast turtle ( 1118037 ) writes:

is the best Scan/OCR app you can buy with many nice features. The first is it's reasonably priced. Second is the fact that it can import PDF files directly and OCR/Index them and it handles almost every langauge on the planet. Definately worth looking into as the school may actually have a school license, which means you don't have to buy anything.
Hire an Information Architect. (Score:2)

by jddj ( 1085169 ) writes:

Seriously. This is what we get paid to do. There's far too much to communicate on a forum, and if the SNR here is typical, you'll get awful, unrelated and just plain wrong advice.
If you can't hire, see if your school has a library science program and look for a good intern.
Failing that, read the Polar Bear book (Rosenfeld & Morville, pub: O' Reilly) yourself and follow the threads to resources particular to your problem.
Tactical help: populate the kewords, title, subject properties in your PDFs and Offi
My father's small business had a similar problem (Score:2)

by bertok ( 226922 ) writes:

My father runs a small business and has to track a bunch of paperwork for each client, so I got him a cheap LED lit flatbed scanner, but like everyone else, he discovered that it was too slow to manually scan in each page, even if the scanner itself was quite fast.
He eventually figured out that the fastest scanning technique is not to use a scanner at all, but a digital camera. He made a rig with a marked out area the size of an A4 sheet of paper, and then he attached a camera mount so that the camera would
doh! (Score:2)

by Joebert ( 946227 ) writes:

I couldn't tell you how many times I've went to use a phonebook or reference manual and tried to flip through to the search page.
PHP Bibtex database manager (Score:2)

by Ardeaem ( 625311 ) writes:

Assuming you use latex, PHP bibtex database manager might be a good option. I use it, and it is quite handy if you want to share the database among several researchers.
Link [cnri-bourges.org]
Fabulous question (Score:2)

by cinnamon colbert ( 732724 ) writes:

to bad the OSS community has no real answers.
this is something i submitted a week or so ago:
"I'm looking for software that can help my company manage information in documents that may be in pdf, doc or web form. I work for a biotech company with 15 people, and we have large numbers of documents that range from very technical scientific publications (usually pdf) to company reports like 10-Ks, to web pages to newspaper articles to pictures. We use these documents to review and stay current with the scientifi
Perhaps you are part of the problem. (Score:2)

by BitZtream ( 692029 ) writes:

I realize that slashdot is going to take the technology solution as the only one, and in this case its probably the right way to go, but ...
People have managed documents and information like this for centuries and it worked rather well, perhaps you should stop being lazy and learn how to use traditional reference materials as you're going to need this skill for a few more years anyway.
Those skills are still useful today. Just because Google can index and allow you to find words in the documents it knows a
Do it the easy & obvious way (Score:2)

by sribe ( 304414 ) writes:

Good grief, forget your FOSS idealogy, scan them to PDF, OCR them using Acrobat Pro (the education price is ridiculously cheap), store them on a Mac, and use the built-in Spotlight to search them.
- Re: (Score:3, Insightful)
  
  by shaitand ( 626655 ) writes:
  
  Copying excerpts for educational use is actually an explicitly protected fair use case. The copyright act actually uses it as an example if I remember correctly.
  The parent said he is copying parts of texts, not entire books.
  - Re: (Score:2)
    
    by Red Flayer ( 890720 ) writes:
    
    Copying excerpts for educational use in a classroom setting is actually an explicitly protected fair use case.
    
    This is not a classroom setting, this is a research setting. Very different.
    
    Though it may be covered under other criteria of fair use, the educational purposes exemption from copyright does not apply.
    - Re: (Score:3, Interesting)
      
      by shaitand ( 626655 ) writes:
      
      According to cornell, limitations on copyright holders are as follows. Note that research and teaching are both explicitly stated cases of fair use exemption.
      107
      Permits the âoefair useâ of an ownerâ(TM)s work without permission â" for the purpose of âoecriticism, comment, news reporting, teaching, scholarship, or research.â This exemption outlines four factors that must be met in order to argue a fair use.
      108
      Permits a library or archives to reproduce works for archiving purpose
- Re: (Score:2)
  
  by Fallen Kell ( 165468 ) writes:
  
  I seem to remember something about "educational use" in Section 107 of the Copyright Act....
- - Success = Copyright Problems (Score:2)
    
    by tobiah ( 308208 ) writes:
    
    I agree that for the immediate use listed there is unlikely to be any copyright violations. But if someone were to make a good collection for their lab, that perhaps then became popular in the department, it would start running into copyright gray areas. For example the university discontinues subscribing to a journal, but articles remain available on a broad intranet system. Normally if you already had a copy of the article that's legit, but now a new student has access to articles that were only availab
    - Re: (Score:2)
      
      by imidan ( 559239 ) writes:
      
      You're right, of course, that such a system could run into problems if its use became more widespread. It seems like one option is to keep the content restricted--can't just add in any electronic resource without a thorough understanding of its copyright terms--like certain linux distributions only including unencumbered code, or (in theory) Wikipedia only including unencumbered images.
      Or, go to the effort of keeping track of the copyright terms and encumberances. This, obviously, is way beyond the scop
- Re: (Score:2)
  
  by tobiah ( 308208 ) writes:
  
  Yup, I've got a crude filing system with hundreds of papers that works great because of Spotlight. I don't bother with OCR for older docs, just punch in some keywords in the file description section. Network drive indexing works if the drive is formatted in HFS+.
- Re: (Score:2)
  
  by Sooner Boomer ( 96864 ) writes:
  
  I know very well what century it is. I was using exageration to emphasize how far the professor needs to come to be effective in the current technological environment.
- Re: (Score:2)
  
  by Sooner Boomer ( 96864 ) writes:
  
  "I suppose you meant "drag a professor into the 21st century"."
  No wp, I meant I'm trying to get him to use technology that was available in the '90's. If can do this, there is hope for progress into *this* century.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Document Management Software and OCR (Score:5, Informative)

Re: (Score:2)

Re: (Score:3, Interesting)

Re:Document Management Software and OCR (Score:5, Funny)

Re: (Score:2)

Re:Document Management Software and OCR (Score:4, Insightful)

Re: (Score:2)

Re: (Score:3, Funny)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3, Insightful)

Re: (Score:2)

Re: (Score:3, Interesting)

Re: (Score:3, Insightful)

Comment removed (Score:4, Interesting)

OCR is aweful (Score:2)

wow.. (Score:4, Insightful)

The bigger problem is not OCR, it's which scanner. (Score:3, Informative)

Re:The bigger problem is not OCR, it's which scann (Score:2, Informative)

Re: (Score:2)

Re: (Score:3, Informative)

Re: (Score:2, Informative)

Re: (Score:2, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re:Document Management Software and OCR (Score:5, Insightful)

Re: (Score:2, Insightful)

OCR, yes, but does he need management (Score:2)

fox? (Score:5, Funny)

Re:fox? (Score:5, Funny)

Re: (Score:3, Funny)

OWL (Score:2)

Try Papers (Score:4, Informative)

Reference management software (Score:3, Insightful)

Re: (Score:2, Interesting)

Re: (Score:3, Interesting)

Re: (Score:2, Informative)

Re: (Score:2)

DSpace? (Score:2)

Beagle (Score:2)

Zotero (Score:3, Informative)

Re: (Score:2, Informative)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2, Informative)

Re: (Score:2)

Cheap scanner, expensive OCR software (Score:5, Insightful)

Re: (Score:2)

Personal Document Management (Score:4, Interesting)

Summation (Score:2, Interesting)

Quick and dirty solution (Score:3, Informative)

Re: (Score:2, Funny)

Re:Quick and dirty solution (Score:4, Funny)

Beagle or Google Desktop (Score:2)

(Let me google that for you)^2 (Score:2, Offtopic)

Papers for Mac OS X and iPhone (Score:3, Insightful)

DevonThink on a Mac (Score:3, Insightful)

Almost recent diss on the subject (Score:2)

So, what I think you're asking for is... (Score:5, Informative)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Suggestion (Score:4, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

I, Librarian seems pretty close (Score:2)

Integration with NASA ADS (Score:2)

Talk to your school librarian. (Score:2)

Digital Archive software (Score:2, Insightful)

Why Not To (Score:5, Insightful)

Re: (Score:2)

Zotero, Mendeley (Score:2, Informative)

Try a server based solution like RefBase (Score:2)

Re: (Score:2)

SIMPLE solution, but not FOSS (Score:2)

The University of Oklahoma participates in JSTOR (Score:2)