Large-Scale Paper-To-Digital Conversion? 459
An anonymous reader writes "I've just been asked to digitize several dozen sets of lecture outlines at the university where I work. Basically, professors want to hand me a big (often 100+ page) stack of their handwritten lecture notes (with messy text, equations, and diagrams; sometimes double-sided) and expect me to post a PDF-or-something-similar to their course's web page. However, every desktop scanner I've ever used takes 1-2 minutes of user-attention per page and the resulting files end up Huge, impossible-to-read, or both. All I have at my disposal is my PowerBook, Acrobat, a couple hundred dollars of department funds for a new scanner (this maybe?), and, if I ask nicely, overnight use of the secretary's Win2k box. Any ideas? Sheet-fed scanner recommendations? Better file formats than PDF (or better PDF settings)? Do any of you students have usability advice?"
Format (Score:3, Interesting)
Are there any web-based packages for searching documents, based on OCR-extracted keywords? Obviously with messy hand-written notes, formulas, etc, OCR won't work reliably. For a similar project, I'd like to OCR the files and use the text data solely for keyword searching. Obviously not perfect, but better than just images.
PNG is your friend....
Fax machine (Score:5, Interesting)
Re:Get stuffed (Score:5, Interesting)
I've had a similar job, where our school's lecturers wanted their notes in the same style so one of my jobs as admin assistant was retyping chapters from textbooks & inserting the original illustrations. That didn't start out too bad until lecturers started basing course notes on entire quarters of books, expecting them to be retyped completely in their own style. Give an inch they'll try to take a mile - use the few hundred $$ to get it professionally scanned.
Recruit the community (Score:5, Interesting)
Get several (dozen) other students to use their own equipment and time in echange for a copy/copies of the completed work.
I would hazard a guess that there are more than a few people who would like to have a copy of the complete series of the lecture outlines.
Easy (Score:5, Interesting)
a) Publication quality DVI/PS/PDF files
b) The student can deepen their knowledge of the topic
Everyone happy. Used to work like this at the university I went to. And you may be even lucky that some student typed these notes in for himself.
DjVu format is pretty good for scanned docs. (Score:3, Interesting)
There is an open-source project http://djvu.sourceforge.net/ that provides code for reading DjVu docs, but I have no idea where to get DjVu encoder.
Re:Get stuffed (Score:2, Interesting)
Isn't that copyright infringement?
Unless, of course, they wrote the textbooks.
Volunteers? (Score:1, Interesting)
now, at this point, you'll likely start wishing that you live in Canada (if you already don't).
The key is in volunteers, to bastardize "1984". Get a number of fairly intelligent high school kids that haven't done thier 40 hours of community service (a graduation requirement).
Now, make them look at the originals, the scanned, and correct all the discrepancies
bonus: if the kids are the nerdy types, tell them that they're learning university material for free.
they could start paying you!
Do you work at Kent State? (Score:2, Interesting)
A bit of opinion on the project. This is not a good idea. Its one more tool that studnets will rely on to memorize information isntead of taking time ti THINK about thier subjects and really LEARN the material.
Re:Simple. (Score:1, Interesting)
Re:Get stuffed (Score:3, Interesting)
Granted if it was something reasonable and I could do it without shifting gears (mentally) I would usually slipstream it into the work I was doing and not write it up. If it was redoing work that I had already done, or worse if I recommended doing it one way and they mandated I did it some other way and after I was knee deep in it decide to go yet another direction or even in the direction I originally suggested
Works for me (Score:5, Interesting)
I tend to scan lots of documents and setup a simple perl script that uses the 'scanimage' command line tool to do the scanning. Using my Epson Perfection 1650 scanner (pretty standard flatbed scanner) I can scan an 8"x10" page in black & white mode in about 10 seconds.
I actually added a button to the Nautilus GUI shell so I can move to the directory I want and hit the button to scan a page to that directory. Very convenient.
I scan to tiff and then use the convert utility (part of imagemagick) to convert to png. The resulting files typically run about 100K to 200K depending on the content.
If anyone's interested in seeing the perl script I've posted it to: www.ollies.net/scanscript.html [freecache.org]
Steve
Re:If you're being 'asked' (Score:1, Interesting)
General principles for document imaging (Score:2, Interesting)
The quality of the scaning is obviously important; get or borrow the best scanner you can. The point made about putting a black backing onto a flatbed scanner is important. Also important is adjusting the scanner settings so that you get minimum noise (random black dots) without degrading the stuff you want to keep.
For this sort of thing you almost certainly want to do it bi-level/B&W/one bit deep (hopefully there are no shaded pictures, but you can use screening for those), and to my knowledge nothing has been developed that compresses these images better than CCITT Group IV (fax machines use Group III). You almost certainly don't want to use grey-scale, at least not for your final images.
You should see if you can find some post-processing software; we used to use ScanFix, which would straighten the image (which makes Group IV compression a lot better) and depending on settings clean it up as well. You also need to decide upon the size of the final images; you want to scan at 200 to 300 or even 400DPI, but you don't have to have final versions at those high resolutions.
The standard used to be TIFF images with Group IV compression, but not every image viewer can read them, or display them well (esp. if the image needs resizing, and I doubt you can assume everyone reading these has their monitor at a high resolution).
If PDF will accept and display images compressed with Group IV compression, you're probably best off with that, since Acrobat Reader is ubiquitous and fairly easy to use.
PNG is a nice format that I use by preference for > 1 bit deep images, but a quick check of some PNG documentation says that Group IV "often" compresses a lot better than 1 bit "greyscale" PNG; it was simply not designed for document imaging. And you also want to avoid JPEG, it's a lossy (will introduce artifacts) system that also wasn't designed for bi-level images.
Hope this helps.
ADF Scanners-Drive a Tank. (Score:1, Interesting)
Anyway I run the output through this [sourceforge.net] and a bit of OCR (doesn't have to be perfect), and store it in a Database.
Re:Xerox Scanner doesn't do OCR (Score:2, Interesting)
It obviously wouldn't be so convenient if he had to go to Kinkos, but they might have it set up on one of their machines. (Yeah, I doubt it, too.)
Re:Some photocopiers support this (Score:3, Interesting)
OR (Score:4, Interesting)
JBIG2 inside PDF (Score:3, Interesting)
So, I recommend scanning to TIFF (or TIFF inside PDF). Even if you don't currently have the encoding softeware, you can convert to JBIG2 compression later as it becomes more and more ubiquitous in the future.
And definitely use a automated document feeder of some sort to keep from going crazy. Newer Xerox machines work pretty well for this (I use a DocumentCentre 440ST for this all the time) unless you have hundreds of thousands of pages to deal with, in which case you should either invest in industrial scanning equipment or outsource to a scanning center that does.
Kinkos isn't worth it (probably). (Score:3, Interesting)
As other people pointed out, if you can get a couple of departments in on this, then you can more easily amortize the costs of really good equipment to do this...
One thing that I'll note is that I don't really like PDFs for this sort of stuff. If you really have a 100 page article, you're going to be looking at a 3 meg file and, perhaps, a 30 second startup time... That's fine for someone who's going to read the document from cover to cover, or print it... On the other hand, it's a pain if you only want to look at pages 37 and 38.
GrokLaw gets PDFs of court filings regularly, and I got so fed up with PDF's that I created a (semi-automated) batch system to split up the PDF's into separate PNG images and create a simple index.
You can see a sample here [bcgreen.com]. Far easier to view a page or two there (IMNSHO) -- but not as easy if you just want to download and print it.
Before you go too far, you might want to get a good handle on how people are likely to use what you produce -- Use that knowledge to decide just how you want to organize the result. You may want to make it available in two (or more) different formats. It's not that difficult to bulk convert things between different forms (at lest, not if you can dual boot into Linux, or have OS/X).
LyX for LaTeX!! (Score:3, Interesting)
Use the fairly user-friendly LyX [lyx.org] to do the LaTeX-ing.
Heck, get the academics themselves using it to prepare their notes in the first place!
They might actually thank you for introducing them to this convenient and easy document processor.
ScanSnap (Score:2, Interesting)
The specified resolution is for a colored documents. For a b/w one, you will get a better resolution. You can obtain scan samples from a Japanese page [fujitsu.com] (pdf files at the bottom).
Actually, a newer model, fi-5110EOX, has already been available in Japan, and I think that is why they are offering a rebate now. The new model have usb2.0 connection and a higher resolution mode (excellent) that is not possible with fi-4110.
Distributed proofreading? (Score:2, Interesting)
The students who are taking these classes could easilly be a source of tappable work hours.
See Project Gutenberg's proofreading site for an example of this type of effort. http://www.pgdp.net/c/default.php
If you could get the professors to offer a little bit of extra credit for proofreading or converting a page, the task could be much easier for you.
Envision this: You use and ADF to scan an entire stack of notes in order, but you don't worry about how the scanning goes on each page. Then you xerox the whole stack and place the copies in a binder in someone's office. The students are then offered 10 points extra credit per page translated from
The points are justified since the student is in the class and learning something by carefully duplicating, analyzing, correcting, and studying the professors notes for that class. (Can you imagine a more likely way to end up accidentally committing three pages of facts to memory?)
You can place the
If the file isn't legible, the student can check the xeroxed copy out from the binder. Since it's just a copy, you don't need to worry about losing it.
You could skip the scanning altogeather, and ask the students to return any pages they don't finish translating.
Obviously this works best for large classes where the student:pages ratio is large.
Make sure you number pages if you do anything like this.