Building a Searchable Literature Archive With Keywords? 211
Sooner Boomer writes "I'm trying to help drag a professor I work with into the 20th century. Although he is involved in cutting-edge research (nanotechnology), his method of literature search is to begin with digging through the hundreds of 3-ring binders that contain articles (usually from PDFs) that he has printed out. Even though the binders are labeled, the articles can only go under one 'heading' and there's no way to do a keyword search on subject, methods, materials, etc. Yeah, google is pretty good for finding stuff, as are other on-line literature services, but they only work for articles that are already on-line. His literature also includes articles copied from books, professional correspondence, and other sources. Is there a FOSS database or archive method (preferably with a web interface) where he could archive the PDFs and scanned documents and be able to search by keywords? It would also be nice to categorize them under multiple subject headings if possible. I know this has been covered ad nauseum with things like photos and the like, but I'm not looking at storage as such: instead I'm trying to find what's stored."
Document Management Software and OCR (Score:5, Informative)
From the sound of it, you want to verify that your product supports document tagging (not unlike Slashdot's tagging system I guess) so that he can attach his categories to documents as he puts them in (or more likely as you do the manual labor, right?).
... where he could archive the PDFs and scanned documents and be able to search by keywords?
So, my big concern is the part where you said he scans things from books and articles and so some of the PDFs might just be massive images, right? I don't think you're going to find systems with OCR built in so you might have quite the chore on your hands. If you don't have it electronically or if it's just an image electronically, you may have to implement some sort of process for getting a doc into this system so it can be searched, right? Look into GOCR [sourceforge.net] or Tesseract [linuxjournal.com] if this is the case.
Also, judging by your nickname ("Sooner Boomer"), you're at the University of Oklahoma. Why in the world would you name yourself after a group of people [wikipedia.org] who not only disobeyed the Indian Appropriation Act but also moved out onto Native American territory before it was officially declared property of the United States? And then you also chose "Boomer" which refers to "white settlers who believed the Unassigned Lands were public property and open to anyone for settlement, not just Indian tribes. Their reasoning came from a clause in the Homestead Act of 1862, which said that any settler could claim 160 acres of public land. Some boomers entered and were removed more than once by the United States Army." If you are a descendant of either a Sooner or a Boomer, I respectfully do not agree with their actions.
Re: (Score:2)
Very well done and informative on all counts.
Re: (Score:3, Interesting)
Another obvious, more practical but potentially less powerful method is simply to index all of the printouts with serial numbers and manually create a database of tags with serial numbers.
You then have a library card index system, but electronic. Sure, it won't help with documents that aren't entered properly, but it's dramatically more efficient than thumbing through PDF's until you find what you need.
Re:Document Management Software and OCR (Score:5, Funny)
If you are a descendant of either a Sooner or a Boomer, I respectfully do not agree with their actions.
Except he's not. He just prematurely ejaculates. And he'd gone all this time with no one drawing attention to it as you just have.
Re: (Score:2)
Unless you're really lucky with the images, OCR requires a lot of work correcting errors. It would probably be less work to just be able to add searchable tags to whatever system is used to store the PDFs and leave them as images.
Re:Document Management Software and OCR (Score:4, Insightful)
Re: (Score:2)
OCR, in my experience, is also crap with equations and technical literature in general. It's linguistic fuzzy matching changes technical words it doesn't recognize into similar words it does, on the basis that well shit it might have just not scanned very well.
Not much way around this.
Re: (Score:3, Funny)
What he needs is a bunch of undergrads or interns to painstakingly transcribe and proofread every scrap and napkin of text!
Re: (Score:2)
Re: (Score:2)
I'm sure there's an undergrad somewhere who'd like a
Re: (Score:3, Insightful)
That depends, if all he needs the OCR for is to build a searchable keyword index then the error rate can be quite high and still get good results. The results of a query should point to the original PDF, not the result of the OCR.
Re: (Score:2)
Wouldn't it be even less work to get somebody else (TAs or some other form of free labour that needs to be on the prof's good side) to do it?
Re: (Score:3, Interesting)
Ugh... for a secon there, I thought Clippy started posting to slashdot. Would you like help with that?
I don't think he's looking for a DMS, which includes lots of things like workflows, audit trails, etc. DMSs are typically used to make an office go paperless, but he's not looking for a processing and tracking mechanism. He's looking for an easy way to create a searchable archive index.
My suggestion? Since he's a profe
Re: (Score:3, Insightful)
It sounds like 'going paperless" is exactly what he wants. It so happens that his office deals in research materials, but that doesn't really change the objective: stop using unwieldy and resource intensive paper documents in favor of highly indexed digital ones.
Out of curiosity, have you considered a wikiesque system... autolinking titles and keywords between articles and some sort of glossary (but not the "let's have everybody able to edit it because everything is an opinion and all opinions are equally i
Comment removed (Score:4, Interesting)
OCR is aweful (Score:2)
OCR is pretty nasty stuff and it doesn't work very well at all. It's probably worth saying that the OCR results should probably only be used to generate your index and keywords.
Actually accessing the document should show the original PDF, not the error riddled OCR scan of it.
wow.. (Score:4, Insightful)
2 years? ago I bought for my small business a fujitu f1-5120c duplex scanner--it came with adobe acrobat
I scan every bill, correspondence, notice, and everything to pdf- then I throw it the hell away.
the version of acrobat included does OCR-I open acrobat, choose create pdf from scanner, and scan away.
I can mix a scan job up between B&W & color or duplex or simplex within one job
I can open an existing PDF and append to it
I save everything to an infrant nas box.
I can go to windows search, type in 1179.21 (actually did this one once)
set to look INSIDE the files of that directory and get results that include
a soda delievery notice, a soda invoice, and my bank statement where I paid it off
they have other model scanners that combine sheetfed+flatbed...
here is a beauty
http://www.fujitsu.com/us/services/computing/peripherals/scanners/workgroup/fi-6230.html [fujitsu.com]
The bigger problem is not OCR, it's which scanner. (Score:3, Informative)
The big attraction of the fi-6130 is its speed: 40 pages per minute.
If you are interested, I suggest you download the manual [fujitsu.com]. (PDF)
The manual talks about a connection for an "imprinter", wh
Re:The bigger problem is not OCR, it's which scann (Score:2, Informative)
Re: (Score:2)
We just replaced an old 3097 with the 6130 and are waiting to hear how it holds up. Note, most of these scanners are deployed to remote scan areas in the hospital where it is the
Re: (Score:3, Informative)
For an Open Source DMS that generates searchable PDF Files, try ArchivistaBox: http://sourceforge.net/projects/archivista/ [sourceforge.net]
Tesseract (including fracture / black-letter recognition) and the Linux port of Cuneiform (BSD licence) OCR engines are used for text recognition. The hocr2pdf module (see http://www.exactcode.de/ [exactcode.de] is used to generate the searchable PDF files.
(http://sourceforge.net/forum/forum.php?forum_id=868471)
Re: (Score:2, Informative)
Archivista seems to be be a solution worth looking into for him. I guess he has to try and install the software and test it himself, because the explanation on the (English) website is almost incomprehensible -- I understand spoken German, but mixing German and English, ugh!
I work with Alfresco, which is a nice DMS, but without the integrated OCR that Archivista seems to provide. Alfresco can integrate with various OCR solutions though, has a very active community -- and a comprehensible website!
Re: (Score:2, Insightful)
Re: (Score:2)
Re: (Score:2)
That's the point. It's a user name. One based (probably) on the words to a college fight song rather than the actions of two small groups of people in the past. Why beat him up about it?
Nice Godwin, but I would venture to say there are very few college fight songs that reference Hitler, or Satan, so it's not really a good parallel.
Re: (Score:2)
Re:Document Management Software and OCR (Score:5, Insightful)
Re: (Score:2, Insightful)
OCR, yes, but does he need management (Score:2)
I think what you are looking for is something called "document management [google.com]" software.
... where he could archive the PDFs and scanned documents and be able to search by keywords?
I agree with the OCR requirement, but if he just needs to search the resulting PDFs, wouldn't DocSearcher [sourceforge.net] do the job for him? I've found it trivial to set up and run and it's certainly helped me keep track of docs etc.
fox? (Score:5, Funny)
I'm trying to help drag a professor I work with into the 20th century
Maybe after that, you should try to bring him into the 21st century. You know, the one where PDF's exist?
Re:fox? (Score:5, Funny)
Re: (Score:3, Funny)
OWL (Score:2)
http://owl.anytimecomm.com/ [anytimecomm.com]
we use this at my office. Works well for us.
Try Papers (Score:4, Informative)
Reference management software (Score:3, Insightful)
Re: (Score:2, Interesting)
Re: (Score:3, Interesting)
Re: (Score:2, Informative)
You can also set your preferences in Google Scholar to provide a link to the paper's citation in BibTex formatting. It isn't always 100% accurate, but it does save a lot of data entry time.
Re: (Score:2)
I'm just digging into my masters and haven't had the best text processing skills, so I'm interested in how others do it.
DSpace? (Score:2)
DSpace ? http://www.dspace.org/ [dspace.org]
Beagle (Score:2)
It may be worth looking at Beagle: http://beagle-project.org/ [beagle-project.org] - it's Linux only though.
Zotero (Score:3, Informative)
Zotero [zotero.org] might be useful.
Re: (Score:2, Informative)
I'll second this one. I'm a doctoral student, and have been using it to handle my research. A nice, simple firefox-based interface. It'll snarf references right off pages from search engines. You can attach things, including links to or copies of pdfs to those references, summaries, etc. You can apply keyword tags to citations, and you can organize the citations into a nice directory tree.
To get them out, there's a sweet interface available for open office. I think it's also available for Word, but I use M$
Re: (Score:2)
You can export some or all your references to a file, sneakernet the file to the new computer, import it into zotero, and you're done.
If you're brave, you can try the beta, which includes syncing functionality. I currently use it to sync between my XP desktop and my OS X laptop, and I haven't had any problems so far (though I make a point of it to back up the database regularly in case something goes wrong).
The Word integration for Office 2007 is fantastic; for Office 2008 on the Mac, you need Leopard, and it's slightly clunkier, but works.
Re: (Score:2)
The main thing that turns me off about Zotero is the poor browsing interface. Why can't I click on an author's name to get all papers in my database by that author? Why can't I click on a journal to get all papers in that journal? Why do I always have to go through a damn Advanced Search to do any of this?
As a result, I use Aigaion [aigaion.nl] instead. It's web-based, though, so you'll need somewhere you can run it (I've got a small VPS that it's on).
Re: (Score:2, Informative)
Re: (Score:2)
Cheap scanner, expensive OCR software (Score:5, Insightful)
Most highend consumer All in One printers comes with an ADF capable of handling most types of paper as long as it's not crumpled up, stapled or the like. Some of the more expensive ones can do two sided scanning to a network repository. I work with consumer level HP printers, and the Office Jet Pro L7xxx series does this. The Pro L7680 is 200 US$ at Newegg.com
Now, while that printer comes with some okay OCR software, it's basicly thrown in for free. A lot of the stuff in the kind of documents you're talking about is going to be math heavy mixed in with images, graphs, tables and personal notes. I don't know any OCR software that'll transform that into exact replicas via LaTeX or the like, I'm pretty sure the really expensive OCR software will translate the written text and reproduce the rest as images and neatly transform it into some easily searchable pdf-documents.
That brings you from paper to searchable pdf-files. Catagorizing those is probably not all that hard. I'd suspect you could do some text analysis and break each document down into a list of technical terms and the number of times they're used.
A document that uses the cashmir effect in a single example is probably not a document related to that specific field, whereas documents that talk about it repeatedly, referencing known articles on the subject etc. is. Sorting that out ... beyond my knowledge.
I'd suggest you start out with an experiment. Take a "typical" page from the binders, scan it to a non-compressed image at a decent resolution (e.g. TIFF). We usually reccomend around 300 dpi for OCR - beyond that you start picking up things that we don't really look for when we're reading.
Test that page against various OCR software, see what they reproduce as the output. Pick the one that's the best result.
And don't worry - the OCR software is going to be the single most expensive purchase in this equation. I am however more than ready to be proven wrong in that regard.
Re: (Score:2)
scan it to a non-compressed image at a decent resolution (e.g. TIFF)
Why noncompressed? Not all compression is lossy. The G4 lossless compression for binary images in TIFF is excellent, and artifact free since it's lossless.
Personal Document Management (Score:4, Interesting)
I am hoping that someone will make a nice personal document management package as free software.
If you use Windows, you can buy this:
http://www.nuance.com/paperport/ [nuance.com]
The basic features would be:
In a perfect world, the GNOME guys and the KDE guys would both start competing over who can make the slickest product and we all would win.
steveha
Summation (Score:2, Interesting)
Law firms use a program called Summation to do this all the time. They take all the paper docs and electronic docs in a case (sometimes tens of thousands of pages) and load them into this program as TIFFs or PDFs. They are then OCR searchable. Not nearly as good of a search algo as something like google, as it is purely Boolean...but it gets the job done. Not sure about cost, but your university may have a license. An alternative is a program called Concordance, which does the same thing. One last opt
Quick and dirty solution (Score:3, Informative)
Re: (Score:2, Funny)
Maybe you're unfamiliar with three ring binders [keyproductsinc.com].
They're archaic devices used to store non-electronic paper-based documents. You can ask your granddad about them.
I'm beginning to think these kids today don't realize that the desktop metaphor is . . . a metaphor!
-Peter
Re:Quick and dirty solution (Score:4, Funny)
Beagle or Google Desktop (Score:2)
(Let me google that for you)^2 (Score:2, Offtopic)
If only GOOGLE had a way to search your DESKTOP [google.com], that would be perfect.
Papers for Mac OS X and iPhone (Score:3, Insightful)
For Mac OS X, try Papers [mekentosj.com]. There's also an iPhone/iPod Touch version. Mac OS is great at handling PDFs in general.
DevonThink on a Mac (Score:3, Insightful)
If your professor uses a Mac, consider Devonthink by DevonTechnologies.
http://www.devon-technologies.com/products/devonthink/index.html [devon-technologies.com]
For searching, the software has an artificial intelligence system, keywords, meta data. It can store PDFs, word docs, emails, notes. It can be integrated with a scanner so you can scan and store documents in the database. It's got OCR built in...
I have DevonThink (personal edition, not Pro/Office) and I don't even use 1/10 of the power built into this system. You should check out some of the reviews online and videos of people using DevonThink.
Almost recent diss on the subject (Score:2)
Looks like that covers your needs.
CC.
So, what I think you're asking for is... (Score:5, Informative)
...something like this:
1. You want to be able to store documents that currently exist electronically, and also handle documents you're going to scan. The latter may, or may not, be OCR'd.
2. You want to attach keywords to the articles, and be able to bring up a list of articles that match some arbitrary combination of these keywords.
3. Full-text search isn't as important (but would be useful if available).
If that's the case, I'm thinking Alfresco [alfresco.com] might be what you're looking for. Multi-platform, open source, java-based content repository. Supports document tagging (and loads, loads more). Relatively easy to use right out of the box, and has a CIFS interface so you can just create a project and simply tree-copy your current documents into the project. Don't let the "enterprise" designation on the software scare you away.
I've actually considered going that route for my own personal document library, but while Alfresco might be one of the only good solutions, it's like killing a fly with a cannon.
I'm frankly amazed that with the "paperless living" meme currently going through the productivity circles that someone hasn't come up with a simple tool to do something just like what you're looking for: point it at a root folder, let it suck in all the files, then start tagging away. Search with keywords or filenames or both, and provide a clickable list of hits. Full-text search isn't needed, as there's already a ton of tools out there that'll happily index your hard drive for you.
And, if a tool like that exists, could someone point me to it, please?
Re: (Score:2)
There's a little company called Arctus that does something like what you asked for (tagging a directory structure), and some more. Its aimed at pharma/biotech researchers, but from what I've seen of them they'd be happy to customize for you, for a price.
Re: (Score:2)
Thanks. I looked at the web page and went down several levels looking at the features of the document management software. I'm put off somewhat by the fact that they don't list any prices. Instead, you have to contact them and they quote you a price for your bespoke application. I'll keep this in mind, and recomend it as an alternative, but for now I'm going to keep looking.
Re: (Score:2)
That's the enterprise version of Alfresco. Check out the Labs version. Free, open source version.
Suggestion (Score:4, Insightful)
I wrote and maintain a project to do this:
http://sourceforge.net/projects/docdb-v/ [sourceforge.net]
"DocDB is a powerful and flexible collaborative web based document server which maintains a versioned list of documents. Information maintained in the database includes, author(s), title, topic(s), abstract, access restriction information, etc."
It's intended for collaborations, but groups from 5 to 500 use it.
Re: (Score:2)
The description of your project does not suggest either full-text searching, or PDF capabilities. I've been looking for something that can index all of my ebooks and do full-text searches that bring up the pages (or better, chapters) that are most relevant.
Re: (Score:2)
It's agnostic as to the document type (PDF/HTML/Word) whatever and can use plugins (beagle, etc) to do full-text search. Generally we find that proper meta-data to describe the document (title, abstract, topics, keywords) is much more useful than a full text search. But yeah, it won't do what you mention.
Re: (Score:2)
That's probably true for cases when content is added by people who wrote the content or understand it well. My interest is in building a library of topics I DON'T know well though. For instance, I want to have ebooks available for random searches which turn up techniques and algorithms I'm unaware of, yet may prove relevant to my work, as and when I find some particular problem worth researching. It's very difficult to catalog that sort of on-demand knowledge ahead of time. Additionally, I don't care if
Re: (Score:2)
Thanks for thye recomendation. I see from the web page that some of the dependancies are perl, My SQL, and it can use a web front-end. Apache? I also see it runs on linux - Debian or Ubuntu Server (or something else?)?
Re: (Score:2)
Right, MySQL, Perl, Apache. Runs on RedHat, Ubuntu, Mac OS X. Anything Unix like.
I, Librarian seems pretty close (Score:2)
If anybody knew of (or planned for) an adaptation to physics (with interfaces to arXiv.org [arxiv.org], the APS journals [aps.org] and ideally other journals), I would be very interested (even as a paying
Integration with NASA ADS (Score:2)
In CERN DS (with certainly a focus on high-energy physics) my papers are shown only up to 2006; so this database appears useless for me.
Talk to your school librarian. (Score:2)
Don't librarian's (particularly those in the library science realm) deal with this sort of thing all the time?
Digital Archive software (Score:2, Insightful)
Why Not To (Score:5, Insightful)
There's at least two reasons the professor's method is beneficial:
1. By having to search by hand and scan by eye, he becomes more familiar with more of what's actually in the papers. His familiarity with the material gets better.
2. Repetitive scanning/searching of the papers leads to the mind partially wandering while doing so. This can result in inspiration and intuitive leaps.
Both methods together are preferable. But good luck on getting the professor to use them. You may have better luck getting him to create his own indices or tables of contents on paper to put in the binders. With his familiarity it shouldn't be too difficult.
Re: (Score:2)
Thank you for this post.
I love Google, and I'm only 30 so I'm not an old school engineer or anything, but I still have a printed copy of almost every reference manual I have ever used.
I don't really know why I prefer the dead tree format as much as I do, and I still usually keep a search able copy available at all times as well as there are times when I know what I'm looking for and can jump straight to it.
Your first point sheds some light on my habits actually, perhaps subconsciously I knew this. Perhaps
Zotero, Mendeley (Score:2, Informative)
Zotero is a firefox extension that can grab reserach papers directly from the journal or library web sites. It organizes the papers in collections, has keywords (they call them tags), can automatically index the PDFs. The metadata is stored also on a remote server and you can browse through it using a web interface. You also get a Word and Openoffice plugins to insert citations in the papers you write. The plugins are a little rough around the edges, but are usable. The
Try a server based solution like RefBase (Score:2)
What's better than having a program on your desktop that can search through your files, how about an online database of your files that you can access from any internet connected computer in the world (www.refbase.net).
Re: (Score:2)
SIMPLE solution, but not FOSS (Score:2)
1a. Windows Desktop Search with added "IFilters", or
1b. Google Desktop Search
I recommend (1a), amazingly, because once you've located and installed all the third-party IFilters - including one(s) for PDF files - WDS will be able to index and make searchable MANY more files than GDS (in my case, about THREE TIMES as many). If the original PDFs from which so much of the binder material was printed are still available, then your effort with the following is greatly reduced.
2. Good major-manufacturer scanner
The University of Oklahoma participates in JSTOR (Score:2)
The University of Oklahoma participates in JSTOR:
http://www.jstor.org/ [jstor.org]
They also appear to be EBSCO participants:
http://search.ebscohost.com/ [ebscohost.com]
I'm pretty sure "the 20th century" is right there already, if you can drag him to the library.
Note: This isn't going to work for people not affiliated with an institution. Both of these services make paper journal content available online for subscription fees paid by the institution (or business), so unless you are in the "bog boy clique", you're not going to have acc
Desktop Search? (Score:2)
The main thing that you need to do is OCR the documents when you scan them in (You can convert non-OCR PDFs into OCR PDFs but I don't know any
I wrote a few articles about this (Score:3, Insightful)
I wrote a few articles about this for Law Office Computing magazine
http://www.nasw.org/users/nbauman/txtsrch.htm [nasw.org]
http://www.nasw.org/users/nbauman/lawdb.htm [nasw.org]
http://www.nasw.org/users/nbauman/discover.htm [nasw.org]
It was a long time ago, the software and hardware has changed, but the concepts are still the same, and the costs are a lot less.
Free text search works reasonably well with small databases, but it doesn't work with big databases. If you want precision, you have to develop a set of tags (we called them keywords). A good model is Pubmed http://www.ncbi.nlm.nih.gov/sites/entrez?db=pubmed [nih.gov]. The New York Times used to have a great text search, but they changed it (eliminated tags) and now it's awfully difficult to get through.
Basically, the researcher has a body of knowledge, and he already has a filing and organizing system (in this case, a looseleaf binder system, which is a pretty good start). You should usually try to replicate his filing and organizing system in the database, for example one field for the looseleaf, another field for the tab, and then some goodies that he couldn't search the looseleafs for, like date, author, journal citation, etc. It would probably be useful to have a controlled vocabulary of a few good keywords, but keywords should be selected carefully so they're unique and don't duplicate.
I assume he doesn't have the PDFs any more. That would have made it a lot easier.
It would be handy to scan every word of (most) every document into full text, but it may not be necessary. Why do you need everything in full digital text? Scanning of unconventional text takes a human proofreading step, and probably isn't worth it.
He'll probably want to keep complete images of the original documents anyway along with the text. You should do a few tests to see how much resolution you need. 600 dpi works for ordinary text like they use in the printed newspaper. But if you want journal articles to come out, with footnotes and superscripts, you might need higher resolution.
Somebody is going to have to enter the fields manually, which is't too bad if you've got a thousand records (looseleaf tabs) to enter (about 20 hours), but can get difficult if you've got an order of magnitude more.
Scanning should be straightforward, if everything is neatly filed away in looseleaf books already. There are many cheap consumer-grade scanners on the market that can get 600-2400 dpi (the bundled software is probably more important than the hardware specs) but they can take up to 1 minute a page; there are more expensive scanners in the =>$1,000 range that can go a lot faster. If you're at a university, look around for somebody who already has one. Law firms and libraries do a lot of this.
You might start by estimating the number of pages and documents you have.
But let me suggest an alternative: Instead of scanning everything, just enter everything into a database without scanning it. Does he really need full text search? Or would it be enough to search his looseleaf books by a dozen fields? He doesn't have to print the document out from an image file, it's right there in his looseleaf books.
If anybody knows of up-to-date articles on this subject, I'd love to know the citation.
Bookends (Score:2)
I didn't read the article and I'm not sure exactly what you want, but take a look at Bookends from Sonny Software.
Saved my butt, made life easy.
http://www.sonnysoftware.com/ [sonnysoftware.com]
Reference Management and Bibliography Software for Mac OS X
ReadIris Pro (Score:2)
is the best Scan/OCR app you can buy with many nice features. The first is it's reasonably priced. Second is the fact that it can import PDF files directly and OCR/Index them and it handles almost every langauge on the planet. Definately worth looking into as the school may actually have a school license, which means you don't have to buy anything.
Hire an Information Architect. (Score:2)
Seriously. This is what we get paid to do. There's far too much to communicate on a forum, and if the SNR here is typical, you'll get awful, unrelated and just plain wrong advice.
If you can't hire, see if your school has a library science program and look for a good intern.
Failing that, read the Polar Bear book (Rosenfeld & Morville, pub: O' Reilly) yourself and follow the threads to resources particular to your problem.
Tactical help: populate the kewords, title, subject properties in your PDFs and Offi
My father's small business had a similar problem (Score:2)
My father runs a small business and has to track a bunch of paperwork for each client, so I got him a cheap LED lit flatbed scanner, but like everyone else, he discovered that it was too slow to manually scan in each page, even if the scanner itself was quite fast.
He eventually figured out that the fastest scanning technique is not to use a scanner at all, but a digital camera. He made a rig with a marked out area the size of an A4 sheet of paper, and then he attached a camera mount so that the camera would
doh! (Score:2)
PHP Bibtex database manager (Score:2)
Link [cnri-bourges.org]
Fabulous question (Score:2)
to bad the OSS community has no real answers.
this is something i submitted a week or so ago:
"I'm looking for software that can help my company manage information in documents that may be in pdf, doc or web form. I work for a biotech company with 15 people, and we have large numbers of documents that range from very technical scientific publications (usually pdf) to company reports like 10-Ks, to web pages to newspaper articles to pictures. We use these documents to review and stay current with the scientifi
Perhaps you are part of the problem. (Score:2)
I realize that slashdot is going to take the technology solution as the only one, and in this case its probably the right way to go, but ...
People have managed documents and information like this for centuries and it worked rather well, perhaps you should stop being lazy and learn how to use traditional reference materials as you're going to need this skill for a few more years anyway.
Those skills are still useful today. Just because Google can index and allow you to find words in the documents it knows a
Do it the easy & obvious way (Score:2)
Good grief, forget your FOSS idealogy, scan them to PDF, OCR them using Acrobat Pro (the education price is ridiculously cheap), store them on a Mac, and use the built-in Spotlight to search them.
Re: (Score:3, Insightful)
Copying excerpts for educational use is actually an explicitly protected fair use case. The copyright act actually uses it as an example if I remember correctly.
The parent said he is copying parts of texts, not entire books.
Re: (Score:2)
This is not a classroom setting, this is a research setting. Very different.
Though it may be covered under other criteria of fair use, the educational purposes exemption from copyright does not apply.
Re: (Score:3, Interesting)
According to cornell, limitations on copyright holders are as follows. Note that research and teaching are both explicitly stated cases of fair use exemption.
107
Permits the âoefair useâ of an ownerâ(TM)s work without permission â" for the purpose of âoecriticism, comment, news reporting, teaching, scholarship, or research.â This exemption outlines four factors that must be met in order to argue a fair use.
108
Permits a library or archives to reproduce works for archiving purpose
Re: (Score:2)
Success = Copyright Problems (Score:2)
Re: (Score:2)
You're right, of course, that such a system could run into problems if its use became more widespread. It seems like one option is to keep the content restricted--can't just add in any electronic resource without a thorough understanding of its copyright terms--like certain linux distributions only including unencumbered code, or (in theory) Wikipedia only including unencumbered images.
Or, go to the effort of keeping track of the copyright terms and encumberances. This, obviously, is way beyond the scop
Re: (Score:2)
Re: (Score:2)
I know very well what century it is. I was using exageration to emphasize how far the professor needs to come to be effective in the current technological environment.
Re: (Score:2)
No wp, I meant I'm trying to get him to use technology that was available in the '90's. If can do this, there is hope for progress into *this* century.