Large-Scale Paper-To-Digital Conversion? 459
An anonymous reader writes "I've just been asked to digitize several dozen sets of lecture outlines at the university where I work. Basically, professors want to hand me a big (often 100+ page) stack of their handwritten lecture notes (with messy text, equations, and diagrams; sometimes double-sided) and expect me to post a PDF-or-something-similar to their course's web page. However, every desktop scanner I've ever used takes 1-2 minutes of user-attention per page and the resulting files end up Huge, impossible-to-read, or both. All I have at my disposal is my PowerBook, Acrobat, a couple hundred dollars of department funds for a new scanner (this maybe?), and, if I ask nicely, overnight use of the secretary's Win2k box. Any ideas? Sheet-fed scanner recommendations? Better file formats than PDF (or better PDF settings)? Do any of you students have usability advice?"
Kinkos? (Score:5, Informative)
High Speed Scanner (Score:2, Informative)
HP Copiers (Score:3, Informative)
HP Digital Sender (Score:5, Informative)
ADF Scanners (Score:5, Informative)
The scanner has four different pieces of software you can choose to use, I'd suggest Precision Scan Pro as that makes multi-document scanning easier.
HP Scanjet 5550c is not what you want (Score:5, Informative)
Our Engineering Society was trying to put up an exam archive with one of them and quickly gave up and started scanning with the flatbed.
Also the scanner has no sane support (one of the few HP scanners that doesn't)
DjVu (Score:4, Informative)
Re:Kinkos? (Score:5, Informative)
Our teachers use them for exactly the purpose described. If you don't have one of these type machines around anywhere, then definitely give Kinkos or some similar establishment a try.
Re:DjVu (Score:5, Informative)
fax (Score:1, Informative)
Use the department funds to sign up an account at interpage.net [interpage.net], which will allow you to fax stuff off to yourself and recieve it as an email attachement. Then use the fax machine in the office to run everything through.
That takes care of the scanning part; cataloging, organizing, and etc will take a lot more time.
You may be able to presuade some professors to fax you the stuff themselves, saving you a bit of time.
Re:HP Copiers (Score:4, Informative)
Worse case, you can get an HP scanner and the automatic document feeder for it. If this is going to happen a lot it should be pretty easy to justify the $500 or so for the scanner, ADF, and a copy of Acrobat.
Re:Kinkos? (Score:1, Informative)
Re:HP Digital Sender (Score:4, Informative)
I've found the Canon Canoscan flatbeds do a good job of automatically scanning straight to PDF, only minimal user intervention (hit "enter") is required. There's a special mode for scanning text which enhances contrast, so messy notes and diagrams should be fine, too. The resulting PDF:s are also remarkably small in size for what is essentially a huge bitmap. I've a Canon Canoscan 8000F myself, it's very fast and can do higher DPI's than most people need, and although it might be a bit out of his price range, I'm sure the cheaper models can do the same job nearly as well.
Try making GIFs (Score:2, Informative)
Also, if you're scanning material with copy on both sides, you might get some visible bleed-through. Try scanning such pages with a sheet of black paper between the page and the lid of the scanner, then adjust contrast to ensure white whites and black blacks.
No good answers AFAIK (Score:5, Informative)
Electronic test-equipment manuals are pretty much worst-case candidates for scanning. In Tek's case, the schematic volumes often consist of hundreds of double-sided, nonstandard-sized foldout sheets (11x23" for example) with lots of fine detail that must be reproduced clearly. You can either scan the pages in segments and leave it to the reader to reassemble them, or you can take the manuals to Kinko's and have the foldout pages shrunk to 11x17" or 8.5x11" for scanning. Either way, it's a real hassle, and highlights a clear need for a "prosumer" duplex sheet-feed scanner solution.
A few years ago you could buy scanners like this one [ebay.com] that could handle arbitrary sheet sizes, but I haven't seen them in stores lately. These may be easier to use than flatbed scanners, assuming the precision they offer is sufficient for your application. I don't know how well they'd work on densely-printed schematics.
Other than bitching about the state of the scanner marketplace, I don't have much to suggest. There are a few hints that will improve the quality and usability of your final document:
Large Scale Paper to Digital Conversion (Score:4, Informative)
The problem is the mixture of graphics, equations, and text.
It's easy enough to turn a page of text into a smallish file. Get a good automatic-feed scanner ($3500 or so) and a copy of ABBYY OCR software. If the original isn't too speckly, tiny, or smudged, ABBYY will give you a 95% accurate text you can then correct. Best format to save in? Depends on what the school is going to do the files. If they're to be posted on web sites, perhaps XHTML. If it's just for preservation, plain text (if there's no Greek characters) or XML with UTF-8.
Equations -- well, there's supposedly a version of XML for math, but Distributed Proofreaders has ended up using TeX, as it seems to be the mathematical standard. While this would work for preservation, it wouldn't work for a web site.
For a web site, perhaps the best way would be to intersperse text with pngs of the equations and graphics. The pngs would still take a lot more space than text, but the files would be smaller than PDF versions of the whole page.
Re:DjVu (Score:3, Informative)
I haven't tried tic98 (mentioned lower in this thread) but I can vouch for DjVu. I routinely scan notices, bills and whatnot mailed to me, then destroy them (rather than maintain a large paper file)
300DPI Black & White scans take about 19kb. They are quite readable, and with 300DPI information, make pretty good printouts.
A Fujitsu scanner, SANE and Quartz Python bindings (Score:5, Informative)
Unfortunately you don't have much use for something like Acrobat Capture because you have handwritten notes to deal with. To process the files, SANE [sane-project.org] and/or TWAIN interfaces are reasonably easy to write code for. The cool thing about SANE is that you can run the saned daemon on any Mac or Linux box, and with a couple of lines of config file changes, it's instantly available over the network from any Mac, Windows, or Unix box (there are TWAIN bridges for Mac [ellert.se]/Windows [ozuzo.net] so it even shows up in Photoshop and so forth); there are also standalone GUI clients like XSane [xsane.org].
I wrote a document management system in Python/wxWidgets (for Windows) in about a month part-time, and it works very well. Either on Mac or Windows, PDF makes sense because of the ubiquity of the viewers, even if you lose a bit in compression compared to more optimized formats such as DjVu. On Windows you can easily embed the Acrobat ActiveX control; on Mac OS X you have native PDF support, Panther's Preview kicks ass, and there are several open-source PDF browsing components such as the ones out of TeXShop or Glen Low's Graphviz port [pixelglow.com] you can embed in your own app.
Given a choice I would probably pick the Mac to do this project, because of the wonderful Quartz/CoreGraphics Python bindings. You can just draw right to PDF, and place PDF files as if they were images; for example, here's a short script to rotate a bunch of PDF files (sorry, Slashdot destroys Python indentation):
You could also use ReportLab [reportlab.org], but because a lot of the PDF processing code is written in Python it's somewhat slower and memory-hogging for high-volume use. (I used ReportLab on Windows for the above project, and use CoreGraphics Python bindings for my research, so I do know what I'm talking about mostlyRe:HP Copiers (Score:2, Informative)
My dad's office (Score:5, Informative)
he has a couple of high speed scanners from panasonic. They cost less than a thousand dollars (4-500) if I remember correctly, they scan at about 20 ppm, and the software that came with them will save each scanned group of pages as a separate document (pdf, tif, whatever). My dad uses this setup to scan all of the files that his cases generate (shrinking his document storage from about 1000 sq ft to 2 shelves in a bookcase). we are talking files that consist of 10,000+ pages, and normally he saves a years worth of cases on 3-4 cds. They can scan up to 500 pages at a time.
Here is a link:
High Speed Scanners [panasonic.co.jp]
Re:Format (Score:5, Informative)
The program I use to convert to TIF is IrfanView (http://www.irfanview.com/), a generally excellent image viewer. I'ts free, too, so no worries there. It offers a ton of options for compression settings for different formats, so you can try other file formats as needed.
Re:High Speed Scanner (Score:3, Informative)
I run paperport to store all of my bills, documents, etc. The HP scanner software simply will-not use the resolutions and options I want paperport to use (200 DPI, B&W).
When using the sheetfeeder, the damn thing always scans in 24bit at 200DPI no matter what I try and set as a default - then I have to manually convert every page.
Go with a different model.
N.
HP9200C? (Score:3, Informative)
Cheap? Dunno. It was just there. In any sort of volume though, the cost drops precipitously (cheaper that you doing a flatbed scanner!).
Check out something like that (or indeed that) used, use it, resell it. Or new, then use/resell. Or get the school to buy it.
If this is a continuous thing, then all the better to own.
Some photocopiers support this (Score:5, Informative)
Re:Format (Score:3, Informative)
I run a micro-publishing business which often involves scanning a lot of B&W images at high resolution. I'll agree that storing files as TIFFs makes them much easier to edit, though. Our final publishing happens as PDFs, though, and it does not bloat the size of the images.
Re:DjVu format is pretty good for scanned docs. (Score:2, Informative)
Re:ADF Scanners (Score:3, Informative)
PDF settings (Score:2, Informative)
If the lecture notes you're scanning don't contain any grayscale or color graphics, your best bet is to scan in black-and-white mode (as opposed to color or grayscale) for smallest file size. I'd suggest scanning at 300 DPI for sharp-looking printouts. Be sure to play around with the "threshold" value (or equivalent) in your scanning software until you figure out what looks best. If it's not set to a good level, text may look too thick and blocky, or thin lines might disappear completely.
Once you have a monochrome scan, you'll want to save in a lossless compression format that preserves the monochrome attribute of the image, such as compressed TIF, and not as JPEG. When exporting to PDF, you could experiment with both ZIP and fax (CCITT group 3/4) compression types -- both compress black-and-white images very well. If your PDF software doesn't have those options, the default should probably be good enough. Even at 300 DPI, most pages should fit into about 30K or so.
University doesn't already have this service? (Score:3, Informative)
It takes away the cost of printing lectures/notes/required readings from the departments and tacks it onto the students who now seem to pay for printing above a certain limit in the labs.
At least this is the way at the universities I have worked at.
Re:Get stuffed (Score:2, Informative)
Truly, ignorance is bliss. This is clearly written by someone who has seen a DocuTech only from a distance.
We have three of them where I work, and I have worked very very closely with them on a number of projects. Sure, they scan quickly, but you can't get the data out of them. They are copiers, essentially.
Re:DjVu (Score:2, Informative)
DjVu is licensed from AT&T labs, and has both a commercial component and an open source [sourceforge.net] component called DjVuLibre. The technology works by analyzing documents, particularly scanned color documents, for hard edges. Hard edges typically indicate text, while smooth, continuous tones indicate background images. DjVu then "segments" the two types of imagery on the page into different layers and compresses them using different formats for optimal compression and quality.
Okay, enough marketing. While it does have some warts, it's a pretty cool technology to work with. That, of course, and I'm happy to have any job these days.
Re:Kinkos? (Score:4, Informative)
If he's at a large university, some other department might have one of these. Xerox doesn't charge for scans when you lease the machine. They only charge for how many prints or copies are made, so it would be essentially free for another department to allow him use their machine. It doesn't even require any additional setup, since you can enter any email address into the machine and have it send the document there directly from the copier. (assuming the SMTP server has been set on the machine)
PDF Settings (Score:3, Informative)
The you should scan in grey-scale or if there is high enough contrast (pen notes, not pencil) in Black and White. The grey-scale with a JPEG medium or even low compressions is going to be much smaller then the deafaults. A pure black and white with group four compression will be even better. At work we scan pages at 300 DPI that way and get 20 to 30 k files (I think, haven't done it for a while).
Also typically images for web viewing of even text are scanned at 72 dpi (all the scholarly journals at my university). This can make things hard to read but really shrinks the file (about 1/16th the size of 300 dpi).
Also if the scanner is set low res pure black and white it will scan a lot faster, but still be pretty slow.
The other option is to pay someone to do it. If you have all of the stuff ready at once and give the pros a week or so to do it when they aren't busy you can probably get as low as 50 cents a page.
Blah blah, I lost my train of thought 2 paragraphs ago
docutech is the way to go... (Score:4, Informative)
on a side note, if the professors are utilizing a lot of additional material which includes might include3 handwritten information, you might consider getting encouraging them to transcribe that material(hopefully your not the TA that has to do the transcription) into a digital for, be it text or WORD. this'll difinitely help in reducing the size of your files.
also consider looking into adobe's pdf service, [adobe.com] if you're overwhelmed with just orginizing the material itself. probably not so kosher to suggest ity on /. but it could be something the school already has an agreement with adobe(taking into account the units of acrobat the school itself might be using). i know it's not rolling your own, but sometimes using an "out of the box" solution to get thing up and running so you can explore other solutions has it's merit as well...
Re:Get stuffed (Score:2, Informative)
Re:No good answers AFAIK (Score:4, Informative)
This camera can be controlled programatically. Automation would be needed to make it practical for a large scale, but it is much quicker than most flat-bed scanners and the quality would be okay for hand-written notes. It would be easy to take multiple overlapping pictures and leave it to software to re-assemble the images.
(Yes, it is a goofy solution, but I works well for me as I normally have my camera handy.)
Re:Kinkos? (Score:2, Informative)
They can usually do up to tabloid size sheets (11"x17") through the feeder, and expect a turn-around time of 24 hours or so. They can also do OCR scans on the resulting files. The OCR conversion runs around US$9.95 for the first page, and US$2.50-ish for additional pages, and they won't do any correcting of the errors such that result from software retardedness unless you pay extra. It hurts, but if you gotta have it...
Good luck!
Re:Kinkos? (Score:2, Informative)
And, use a low resolution setting (say 100dpi) for handwritten documents. It will do just fine. Pdf (depending on the driver though) compresses the image. If you use a machine something like the Konica, try to set the threshold/brightness to a level such that the empty portion of the pages will appear as plain white; this will increase the compression ratio significantly.
So, my recommendation is that try to find a Kinkos which has this type of machine. If you can't find, just tell the professors that it is simply not a reasonable task that can be done in finite time.
Depends how good you want to do it (Score:4, Informative)
If a digitized copy of the manuscripts will do for you, you can go the scan -> image enhancement -> OCR -> save to PDF way.
For scanning, you already got a lot of good comments how to automatise the scanning of dozens of scripts. If you lack these possibilities also a SCSI or USB desktop scanner should do the job (it's definitely less than 1 min per page), so you scan a script in 2 hours. No need to bother to outsource the job to India. Probably you can scan B/W and don't need greyscale or colors. I would scan handwritten scripts at 200 DPI and save the whole pictures in front of the OCRed text, so the user doesn't see the OCRed text and can only use it for selecting and copy&paste. It would be too much work to correct the OCRed text here. For machine written text I would use 300 dpi or more for better OCRing.
As image enhancement you only need to be able to automatically orient the page so that the text is horizontal. I don't remember if Acrobat does it, but for this job I would anyhow get a good OCR program.
As OCR program I recommend FineReader, but also Omnipage is ok. FineReader does better OCR than Omnipage and Acrobat. It also saves better to PDF (with retaining all of the paragraph structure) than Omnipage.
If you keep the image before the OCRed text in the PDF you can expect files of 10MB for 100 pages for B/W scan at 200 dpi. OCRing of machine written text has become incredibly accurate, so you can do real OCR there and throw away the bitmap picture. This of course gives much nicer output (and smaller filesize), but you need to spend a lot of time correcting the text. Here the best OCR program really pays off (you probably have a lot of words which are not in a dict, need custom dicts (does Acrobat have them?),...). A program with a single flaw (e.g. that recognized you formula as text, or code as paragraph text,...) will let you waste a lot of time correcting it on every second page.
Re:HP Digital Sender (Score:4, Informative)
I have one of these on my office network, and and I agree that they're pretty good machines--though I have some complaints about them.
First off, I don't believe their functionality justifies the $3100 price tag. While the feature set it good, for that kind of money, this thing should be able to OCR, and not have to rely on 3rd party software for that functionality.
Secondly, their "scan to file server" feature requires a server side daemon to run--you can't simply drop the document to an SMB or NFS server. Further, the daemon only runs on WinNT/2k/XP systems, and you need to do a little bit of hacking to get it to run as a service, instead of opening it manually (or via startup folder) on login.
Third, it can be DOG SLOW. In particular, when scanning multiple large jobs (particularly at higher resolution) the thing will bog down. It also can only handle a fairly small number of jobs in queue at any one time. One of our secretaries can fill its queue in short order, and have to wait about ten minutes before she can scan the next document packet. When she's trying to scan a hundred packets, this essentially becomes her main focus for a work day.
All in all, our Toshiba copiers seem to do the same job better--of course, they have their own problems (i.e. over $20k each, with a poor user interface, and they don't do color, and don't OCR either.)
Large Scale OCR and/or PDF conversion (Score:2, Informative)
8100C (Score:3, Informative)
Re:Xerox Scanner doesn't do OCR (Score:2, Informative)
How I Do It (Score:3, Informative)
Stay with your current flat-bed scanner. Do not waste money on a sheet-fed scanner. You do not have nearly enough money for a high-end Fujitsu or Bell & Howell sheet-fed scanner which will reliably get the job done without mechanically screwing up. The pros use high-end scanners because they never screw up and they go fast. Cheap sheet-fed scanners miss sheets or jam up too often to trust them with anything. Make a sign-up sheet for work-study or volunteer students in your academic department to sit down at your computer and scanner and scan the documents into the computer. Give them free pops and gummy bears (slur it so it sounds like "rum & beers") or something similar which won't transfer from fingers to documents. Just take a few minutes to set them up and show them what to do. Keep it simple. Let those empty minds waiting to be filled with knowledge (and beer) do the time consuming zombie work. You should focus your attention on how to put the files on the website.
The scan file format I use is Portable Network Graphics format or PNG format. On average, it compresses black and white graphics 20-25 percent smaller than the widely used GIF format. PNG format is also supported to a basic enough level to be displayed using MS Internet Explorer, Netscape, Mozilla, and other internet browsers.
I use free Xsane scanning software on a linux system to scan the documents. Xsane can be set to scan in line-art mode, also known as black and white mode. This software can also be set to save files directly to disk in PNG format and automatically change the file names using numerical iteration, i.e., file-01.png, file-02.png, file-03.png, etc. without the need for human intervention to change the file name each time. I use a 100 dpi scan resolution setting because documents do not need to look ultra-smooth; they just have to be legible. Anything beyond that is a waste of hard drive space. Using this resolution also means I do not have to spend time embedding the graphic file in html code to constrain its width so it can be viewed on the average 15", 800x600 resolution monitor. I just insert weblinks to the individual, one-page graphic files: "Page 1, 2, 3, 4,
High volume scanning (Score:1, Informative)
Equally important is good scanning software that will break up your images into the files you want. Software can read bar coded sheets of paper to start new documents, or to index them in a database. Legal firms use software to match a database to the Bates stamps that ID files submitted in discovery. Software can remove the black border around the page, de-skew the image, or de-speckle the image.
As for PDF, it's a great choice compared to a lot of formats. And with Acrobat, you can do bulk conversion of multi-page tiffs to PDF in one pass.
A service bureau will do the job for less than 10 cents a page.
My throw-away email account is bugmenot@fastmail.us in case of questions.
Hope this helps
Re:University doesn't already have this service? (Score:1, Informative)
Re:Get stuffed (Score:5, Informative)
I've been working with various versions of the Docutech system for about six years, and they're in use in most of the professional copy/print shops around, at least in Scandinavia. They scan full page and double sided, 600 dpi at about 1 page/sec. Newer versions also can handle full colour.
Native document format is tiff images with a proprietary control file (structuring, positioning etc), but you can easily convert it to pdf.
I'd guess that a professional shop will charge you about 30 cents a page if you accept the raw document files without 'touching up'. This is more than adequate if you're just going to reproduce it on paper, or even distribute the PDFs. It'll weigh in at about 100k a page for the tiff format, and a lot less for the PDFs. This is black and white, which in most cases will suffice.
Professional equipment (as in contracting a print shop) is definitely the way to go. I know that at the University of Oslo, Norway, they have established an in-house shop that will do this type of work internally for just about cost. Maybe that's an idea to put forth to the management? Surely your university will find other uses for it than just your assignment.
Hope this helps
I'm archiving stuff at my university (Score:4, Informative)
We have a dual-processor G4 and an Epson 1640XL large-format FireWire scanner [epson.com] with the optional auto document feeder. It's probably a bit out of your budget ($2899 + ~$1200 for the ADF) but it's awesome. It can scan at up to 1600dpi and the ADF can automatically duplex and scan both sides of the page. We're using OmniPage Pro X for OCR software.
Right now we're more concerned with scanning the documents and getting them online, so we haven't started OCR'ing everything yet. But the ADF is awesome. It can scan both sides of all 300+ pages of a yearbook automatically in about 2 1/2 hours.
The newspapers are a bit different. They're getting a bit fragile in their old age so we have to manually scan them. We scan them at 300dpi in full color, so the 12x18 pages are around 50MB per page. But the scanner takes less than a minute per page. It's impressive.
We use Photoshop's web gallery feature to generate the image galleries. Pretty simple really. Let me know if you have any questions.
I do this all the time (Score:3, Informative)
Yes there are scanners out there that can work for you. I have a Canon DR-5020 which we just feed it a ton of paper and come back in a few and it's done. It can scan VERY quickly. PDF format would work just fine as well. It's the best option especially since it's hand written notes as well.
If this is a requirement which is going to be on-going then you will have to pony up the money and spend a few thousand. If you're not ready to do that, you may be in luck. Some places will lease it out to you and with that few hundred bucks I'm sure you can easily get a hold of one for about a week or 2.
Look up for people who do Document Imaging, and you should find a lot of business that come up. If you're in the washington dc area then maybe I can help you out quite a bit.
Re:ADF Scanners (Score:3, Informative)
Re:Get stuffed - outsource to india (Score:3, Informative)
Re:Get stuffed (Score:1, Informative)
I assumed all models allowed this, but I guess not.
We do this kind of work at our University Library (Score:1, Informative)
1) Professor submits stack of documents
2) A person at the library makes sure all the copyright stuff is in order.
3) For each document, a piece of software that's part of Gobe is used to create a "cover sheet"
4) Each of the documents are stacked on top of eachother with a cover sheet on top of each
5) The stack is placed in a sheet-fed scanner
6) Hit go
7) ???
8) It's on the web in 10 minutes!
Re:Get stuffed (Score:1, Informative)
If the professors are doing lecture notes with many equations they might consider putting the notes in LaTeX format...
Its really the best way to do lecture notes that can be edited... far more useful than a pdf file of a scan of hand written notes.
Of course that makes things easy on you by making things hard for them...
Re:8100C (Score:3, Informative)
The color tiffs use a depreciated form of tiff that was rescinded from the standard as unworkable. On top of that, the version they use does not work with an variant of libTiff. Basically you are stuck working with a few windows programs... and graphics converter on the Mac. Photoshop with sometimes even choke on them.
When we try and scan yellow documents the scanner will occasionally freeze up. It seems to happen sooner in tiff mode, later in PDF mode... but eventually it just freezes and has t be rebooted. Reloading firmware does nothing to abate this.
Oh... and it does mean that the scanner has to feed a windows box, as I have not found a means of attaching it to anything else.
Re:ADF Scanners (Score:3, Informative)
And for the record, you aren't limited to only 4 software applications for scanning (at least in Windows, any application will work if it uses TWAIN). Perhaps you were referring to the document feeder having limited software compatibility?
(Off topic, but amusing nonetheless if you didn't know, there's an easter egg [eeggs.com] that's quite humorous..)
Re:Fax machine (Score:2, Informative)
Fax on a computer is actually TIFFg3 format, created by Sam Leffler whose name is on all the old BSD UNIX copyrights. It's a wonderful tool, and a wonderful format, and easily transformed to other needed formats by the various tools in the TIFF library still published by Sam, but faxes provide only 196 dots per inch at "fine" resolution, and that's usually not good enough for documents and hand-written notes.
A good flatbed scanner is really your friend, for price and performance.
Re:No good answers AFAIK (Score:3, Informative)
DjVu is probably the best format for the poster's needs. I had a university class where nothing was ever handed out to students in hard copy and documents were instead posted on the web;
Re:Kinkos? (Score:2, Informative)
Another option is to look in the phone book under "litigation copying" or "legal copying". Lawyers often scan thousands of legal documents and have them indexed by keyword. Data entry people get paid to skim each document to record the keywords before the documents are actually scanned. Price quotes are based on the quality of the originals (staples, torn sheets, etc.)
Spring for tiff conversion and ftp storage (Score:2, Informative)
Alternatives (Score:3, Informative)
Check to see if the department copy machine has scan functions... most built in the past few years do, even if they aren't used in most places for that. You'll get a decent sheet feeder and way faster scanning than most desktop sheet-fed scanners.
If you have to buy something and have to go *really* cheap, you could get a multi-function print / scan / fax thing. Most will handle legal size, because they're not actually moving the sheet fed paper onto the flatbed glass... the image element stays stationary while the paper goes by. But, of course, you get what you pay for... expect to spend time dealing with paperjams and skipped pages. However, it should be faster than hand-feeding a flatbed.
Software:
I mention this simply because nobody else has (that I've found): Scansoft Omnipage Pro is designed for highly repetitive, batch-oriented OCR. It has options for doing automated or hand-tweaked "area recognition" (separating text from graphics) and has the best proofreading UI I've seen... it flags "low confidence" recognitions automatically, and displays both it's best dictionary guesses and the actual scanned words. Not sure it will help much with hand-written work, but for printed material it works well.
Format: Your primary concern when looking for a destination file format should be longevity... will the files be readable 5 years from now? I've seen a number of people recommending highly efficient but obscure compression schemes, which are a terrible idea if you want the data to stick around. Saving a few bits doesn't do you much good if you can't figure out what they mean. I recommend that people scan to two formats, just for safety (Omnipage can do this automatically).
-R
Re:Get stuffed (Score:2, Informative)
This feature isn't available on every Xerox color (we don't have it on our new 2045, unfortunately), so you'll have to check around a little. Check with print shops and see if their Xeroxes have a "scan to file" capability.
Ricoh Aficios, Ancient Fujitsus, and OmniPage Pro (Score:5, Informative)
we've gotten a bunch of jobs like this - turning handwritten documents into searchable pdfs
We had to do this, too. For a Court, which requires the reasons, decisions, etc. to be publicly available online.
*Thousands* of documents, hundreds of pages each. The responsible department got me, as the IT guy, to set it up for them (after they'd already bought the stuff to do it).
Basically, a couple of Ricoh Aficio series copier/scanners, a couple of ancient Fujitsu sheet-feed scanners, and a bunch of students sitting all day in front of computers running OmniPage Pro.
The Ricohs were great on paper - fast, networked, etc. but their scanner drivers were poor (reminded me of bad CD-ROM drivers - "Copywrite 1995 Behavior Tech Computer. All right reverse." [sic,sic,sic]), and their service (contract) involved having to call the Ricoh guy because the scanner portions randomly wouldn't appear on the network, then wait for him to appear while at least one of the students sat idle. 2 stars out of 5.
Ancient Fujitsu scanners, black and white only, don't remember the model number, required proprietary SCSI cards, no support under Windows NT/XP/2K. These were commercial-grade super-expensive scanners when new (about 1990). Installed Windows 95 on a bunch of relics with ISA slots for the SCSI cards and let 'er rip. Scanning was fast, feed was reliable like a good-quality photocopier or fax machine. Only issue was requirement for an old computer running an old OS; better overall than Ricohs - 4 stars out 5.
OmniPage Pro 12 - reading was *excellent*, far better than anything else I've ever seen. Handled French and English, simple monochrome diagrams, etc. with only very small occasional formatting problems. Print to a PDF using Acrobat on the file server. Only real problem was stability, frequently locking up and losing the scan and OCR on page 99 of a 104 page document. 2 stars out of 5, being punitive because of frustration.
As they got to be more proficient with OPP, and as OPP's dictionaries filled up, we were able to add more and more computers and scanners, so that they were running around, tossing files into the scanners, stapling scanned documents back together, and occasionally rebooting one of the Windows 95 workstations. Peak was 15 computers and scanners.
Task took 3 students 3 months full-time.