Large-Scale Paper-To-Digital Conversion? 459
An anonymous reader writes "I've just been asked to digitize several dozen sets of lecture outlines at the university where I work. Basically, professors want to hand me a big (often 100+ page) stack of their handwritten lecture notes (with messy text, equations, and diagrams; sometimes double-sided) and expect me to post a PDF-or-something-similar to their course's web page. However, every desktop scanner I've ever used takes 1-2 minutes of user-attention per page and the resulting files end up Huge, impossible-to-read, or both. All I have at my disposal is my PowerBook, Acrobat, a couple hundred dollars of department funds for a new scanner (this maybe?), and, if I ask nicely, overnight use of the secretary's Win2k box. Any ideas? Sheet-fed scanner recommendations? Better file formats than PDF (or better PDF settings)? Do any of you students have usability advice?"
Get stuffed (Score:4, Insightful)
If you're being 'asked' (Score:5, Insightful)
It sounds to me like a damned hard job to automate (which is the only way it's not going to be a constant drain on your time), and you're being given next-to-no resources to even come up with a creative solution. Sometimes the best answer is in fact 'No' - it forces people to re-evaluate what they're asking. It comes with the danger of being sacked if it's you that's being unreasonable, of course....
Simon.
Re:Knee to the grindstone... (Score:5, Insightful)
Faust7 is right about this one. Frankly, OCR is ok, but not great - on nice text on book-or-better paper. Handwritten notes? With equations? No. Not unless your profs have some damn fine handwriting and we all know that that is absolutely not the case.
My advice is the same as Faust7's with these additions: spend some of that money on a really nice keyboard, wrist-rest and/or maybe a nice monitor. You are going to be needing all three. If there are any left over funds, get some really nice tea. I suggest Twinnings English Breakfast or Prince of Wales, if you're going to go bagged.
Re:HP Copiers (Score:3, Insightful)
I think the real answer is that this guy is S.O.L.
where to look (Score:3, Insightful)
But the broader question is whether this is really a good idea. The result is going to be huge files, which will be messy, hard to read, and will lack an index or table of contents. Seems like a case of profs with too much ego and not enough willingness to put their own work into more useful form.
Re:Outsource it (Score:5, Insightful)
Re:Get stuffed (Score:3, Insightful)
How do you know he's getting paid to do it? Some professors have a nasty habit of getting all their nasty, menial and boring stuff done by their students who are already working on their degree projects 12 hours a day, six days a week.
Ok, so for some reason I assumed that the poster is a student so my initial reaction was probably off. I would never assign such a menial, dead-end task to my postgrad students, nor would I have accepted such a task without objections when I was still a student.
Re:Get stuffed (Score:3, Insightful)
Re:If you're being 'asked' (Score:5, Insightful)
Explain the enormity of this scratched note-to-finished Pdf to this educator. Use crayons, mirrors, yarn and tape if necessary to get your point across. Just be diplomatic :P
Re:Simple. (Score:5, Insightful)
Take the few hundred you have to spend on equipment and spend it hiring a few temps.
A good typist should be able to type up hand written notes faster than scanning them all in and manually fixing all the mistakes.
Re:Get stuffed (Score:5, Insightful)
I used to have a boss that would say things like "this should only take you about five minutes". I finally told him, "nothing takes just five minutes, if I have to stop what I'm doing there is a startup/teardown cost for every task." I convinced him that there was a granularity of 1/2 hour for every random task he wanted done. The discussion was fruitful for both of us, he was more reasonable about his expectations and put a bit more thought into what he wanted to distract me from my primary task to do.
Now, the original idea is a reasonable proposition, however, it isn't really the sort of thing that should be done for just one prof. Perhaps several departments can combine their resources to setup something that will allow this type of thing to done in a reasonable time frame.
plurvert
Re:Simple. (Score:2, Insightful)
Outsource the job to India.
Not as bad an idea as it sounds. My advice is to not waste the department's money, and your time, buying, installing, and using a sheet feed scanner. Somebody in your local area assuredly has one already that they either rent out to people in your situation, or that they use to do the work you need done.
Use the funds that the department gave you to have your local copy shop do the work. They will almost certainly do it faster than you could, and the end product will most certainly be better than what you could provide. This is the kind of thing that the people who work at copy shops do for a living.
Also PDF is a great format for this, highly portable, and so far fairly version proof. You don't have to worry about the PDF being obsolete before the professor decides to change the structure of his class.
All you can do... (Score:5, Insightful)
DON'T CLEAN UP THE SCANS. Don't even look at the scans. DO NOT RETYPE ANYTHING.
With the kind of volume you say you're receiving, the only way you're going to survive is to:
1. close your eyes,
2. load the documents into the feeder,
3. press 'scan'.
4. Make sure everyone knows this policy.
Re:If you're being 'asked' (Score:1, Insightful)
It's not a technology problem it's a problem of (Score:3, Insightful)
What about usability? (Score:1, Insightful)
In my experience, it is very hard to predict what students will use for any given class based on the moronic ramblings of
I'd like to say that you're at a really shitty university that would take this kind of student-hostile course of action, but then, I checked out MIT's Open Courseware only to find that the first course I looked at, Gilbert Strang's linear algebra, was a botch job. There was a postage-stamp-sized video of Strang telling anecdotes on the first day of class that could only be appreciated by someone who'd already taken the class. So much for leveraging the web's inherent strong points!
Comment removed (Score:4, Insightful)
Re:Get stuffed (Score:5, Insightful)
Besides the fact, it sounds like they are not aware of the time involved in scanning off 10's nonetheless hundreds of pages. It doesn't sound like they are too anxious to make it easy for him to get the job done either (not buying him new equipment, using the secretaries Win2k box after hours??).
I've volunteered my efforts before on a simple scanning job that required hundreds of regular photos to be scanned in at relatively good quality (why else do it otherwise), and ended up taking forever. Upon informing the client of the amount of time required, they adjusted the way the job was being handled.
I think being straight with your employers, and clients is the best approach to any situation where too much is being expected. The times I've had these instances come up, and recommended different approaches that resulted in money being saved, or manhours on a task being reduced, I saw benefit in my paycheck through raises or promotions.
Re:If you're being 'asked' (Score:3, Insightful)
professionally (Score:3, Insightful)
The professional approach is to go back to them and clarify the outcome:
(a) you can scan the documents in, and they'll take X amount of space, and Y time; and this doesn't include OCR;
(b) you did a few tests (using the supplied document) and these are the results for TIFF, JPG, PDF, etc;
(c) OCR is probably infeasible (or not, do some tests) because of the nature of the documents;
Include in (a) the option of purchasing an automated document scanner, and the corresponding reduction in time.
Based upon all the above, get a clear go-ahead, and make the purchase if new equipment is authorised.
You said "where I work": this is your job: it's a bit poor to do as the other posters suggest and refuse to do the work: you need to make sure that the customer (professors) understand exactly what they are getting, and give them a choice to buy into it or not - i.e. "clarify the expectations".
If you assess that it's 2 weeks worth of work, and the professors don't disagree, then you're supervisor just has to put up with it.
PDF of handwritten notes is DUMB!!! (Score:5, Insightful)
It sounds like the school wants to shift the production costs (i.e printing) to the students. This seems inefficient because the old way where the instructor could go to the copy center and have the notes copied the at the schools expense (I know these expenses are often passed along to the students anyway), rather than at the students DIRECT expense of their time for downloading, then printing out on their own equipment or using their own printing accounts at the computer center.
If the notes were being OCR'd and then made available on-line, or post processed in such a fashion (where they are searchable, indexed, etc) where they were searchable, it would be useful. Otherwise this seems like a waste of time and money.
-MS2k
Get a Canon Document Scanner (Score:3, Insightful)
The company I work at scans large amounts of documents to PDF format on a daily basis. Depending on the volume some people do, we use either a Canon DR-3060 or DR-5020 document scanner. These will scan both sides of a page simultaneously, clean up the image (despeckle and deskew) and convert them into TIF or PDF all on the fly. They're fast too. Between 20 and 50 pages per minute. Only problem is that they're expensive.
For your budget, you may be able to afford the Canon DR-2080C [canon.com] which goes for around $600. It has all the features of the more expensive ones, but it's meant for smaller volumes like what you're dealing with. With that, you'd be able to scan 100 pages into a pdf document in around 5 minutes.
Re:Xerox Scanner doesn't do OCR (Score:3, Insightful)
I know there are Adobe archival systems that store the scanned image, along with whatever text they manage to recognize. You don't expect near 100% OCR accuracy from an old, largely handwritten sheaf of lecture notes and transparencies. But hopefully enough is recognized to be of some use.
ask your professor to be precise ... (Score:3, Insightful)
I think what your professor wants is not a bitmapped copy of his handwritten notes or some vector curves that resembles such, but actually a typeset version of the lecture notes. If that is the case, assuming that his handwritten notes are sparse (and hopefully without diagrams, since it takes more time to mess around with them), you can definitely do a stack of 100 sheets in a week, or, as someone already suggested, hire some typists to help you out.
Re:Xerox Scanner doesn't do OCR (Score:3, Insightful)
Ok. Use voice Recognition (Score:1, Insightful)
2. Dictate the Essay, albeight a bit lengthy, into it.
3. Import to Word or your favorite word processor.
4. Add any cool equations and such that you cannot dictate.
4. Publish to PDF.
Nice small file size I'm sure.
Scanning is nice, but it only works with fonts it can recognize. Not Proffesorese.
It could take you a day or so to dictate, but after your finished, more than likely you will have alot less spelling and random letter and symbol problems.
But again, this might be more work that you want to do. Why? Well, if you do it this way, make a nice clean portable document that everyone can read, you might find yourself getting more "extra work" than you wanted.
either outsource or use a digital camera (Score:2, Insightful)
Have it professionally done, like other people here have recommended. High-end sheetfed scanners are great, but you probably can't afford one, and it wouldn't make sense as a one-time expense for this small of a job. I'm a big fan of just handing someone some money and it's magically accomplished.
Alternatively, use a digital camera and well-lit copy stand. You can improvise a copy stand with a tripod or whatever, but make sure you have a lot of light. It's a lot faster than using a scanner, and the results are acceptable if you have a good camera. The more megapixels the better - don't use the old 1.3mp one you have lying around. 3mp will technically work, but more is better. Ideally a digital SLR pointed straight down at the page, a very well-lit area (a clamp light on either side of the page works nicely), and you sitting there sipping Starbucks while you hit a cable shutter release after you flip every page. You could get a few hundred pages an hour done this way--your only limitation is how fast you can turn the pages. You'd only have to stop to transfer images to your computer, and you only have to do that often if you don't have enough memory cards. After you get all the pages into the computer, feed them into Acrobat and you're done.
If you don't want to use acrobat you could make a web-page with thumbnails linked to the hi-res images. Then your end-users wouldn't need to download the Acrobat reader. I love Acrobat's ubiquity but hate the file sizes and the slow start-up time.
Mass Scanning (Score:2, Insightful)
Don't bother (Score:3, Insightful)
The mere fact that it's handwritten means that it's basically a rough draft that was hastily flung together. Send them back to him, and have him type them in and rework them until he figures they're worth recycling for next semester. The prof will save time in the long run, and the students will have something nice, clean, and organized to peruse.
Prof. Clueless, PhD (Score:2, Insightful)
After I stopped laughing, I realized this may be a serious inquiry rather than a joke. I've assisted local government agencies in converting clear, printed, 8.5x11" text documents into searchable text / pdf documents, and the cost for these is over 10 cents a page. (Tax and mill levy records have to be verified 100% correct, as I'm sure your prof's notes need to be.) That's with volume discounting (> 500,000 pages), using nearly perfect ascii text documents, not scribbled notes.
So my advice is to get a few bids from outside contractors, then submit a realistic estimate based on the average. Hint: Given those spec's, it's clear you/your management have no idea what's involved in this process. (Shows at least a modicum of IQ that you had the good sense to ask, however.) If you simply need to scan/save as pics (jpg/tiff -> pdf), you can do this yourself at reasonable cost/effort expenditure. Seems to be implied that you need OCR capabilities for handwritten text, as complicated as equations at that, so you're really pretty screwed. Even simply creating 100-200 kb jpg's & emailing them in an automated process is going to run into problems when the campus mail servers refuse to accept attachements larger than a Meg.
Good luck, BWAhahahahaha!