Automated OCR for Forms Processing? 30
Oscar Carrillo asks: "We have to do a large NIH grant which collects tons of data. And much of that data is in the form of questionnaires. The forms will be available on the web, but it's mostly not feasible to have the subjects sit in front of the computer all day (not to mention that people get annoyed sitting in front of a computer all day). The study is being conducted at several universities and institutions around the country. Using Linux/JSP/Struts/PostgreSQL will take care of most of our needs. But it would save a lot of data entry, if all forms could be scanned at each site, images uploaded to the website, and then automatically put through OCR (Optical Character Recognition) to get only the relevant raw data that subjects wrote. Does anyone know of something that can handle this? Are there any open source projects that can handle this? Any good commercial alternatives?"
Bubbles (Score:3, Insightful)
Do not count on handwriting recognition to be successful for the people who fill out the surveys. While it works fine for typeset and computer gnereated print, it won't work for many different handwritings and many different idiomatic expressions.
Re:Bubbles (Score:2)
Or maybe this [google.com]?
Due to the hardware involved, I imagine this isn't something that some OSS coder is going to slap together. NIH should have a reader/software somewhere.
Nature of questions would help answer the question a bit better.
NIH has an OCR website (Score:3, Interesting)
Re:NIH has an OCR website (Score:2)
How many pages are we talking about? And at what resolution are they scanned? Anything below 256DPI for handwritten is not worth it.
Depends on the economics: (Score:2)
For offsite form entry, Psion Revo's worked wonderfully, but we've moved everything to Sharp Zarus due to the un-certain future of Psion. We've kept the Psions, but just replace them when they break.
Sorry for the ramble, It's lunch time.
another lowly subject (Score:4, Funny)
Then I guess somebody forgot to tell my boss.
Re:another lowly subject (Score:2)
Solutions (Score:3, Interesting)
High school students? Technology isn't the answer to everything, and if these are handwritten you're not going to have very much success trying to automate the recognition. My name is Fod Na1oyyy, etc
Fla. (Score:3, Funny)
Doesn't the State of Florida has a forms tallying system they're looking to unload?
OC R forms processing is problematic at best (Score:4, Insightful)
OCR forms processing does:
OCR forms processing does NOT:
Frankly, most of the large projects I worked on could have gotten the task done easier and cheaper writing an app to run on low-end Palms given to each interviewee. Seriously.
If you would like more concrete advice or contacts with people in the industry, email me.
A few commercial packages: (Score:3, Informative)
I have used TELEform [cardiff.com], by Cardiff [cardiff.com] and was somewhat impressed. It can take multiple different inputs (scanner, fax, email, web POST, etc), run ICR on them, and store the data in a number of different RDBMS. I believe that this is a Windows-only package.
Another piece of software that I cannot recommend as I do not have experience with is from 170 Systems called [170systems.com] 170 MarkView [170systems.com], which basically does the same thing.
I have used TELEform in a medical/clinical setting, where doctors fill out prescriptions, fax them in, and character recgonition is run server-side where it is verified by my data staff. It works pretty well, but you need to keep in mind that handwriting recgonition is not infallible, and if you are interested in any level of accuracy, I would recommend that a human verify each comb-box where there is handwritten text. Most verifiers that I've seen are pretty good and you can glance at each each and just pound the tab-key to scream through the fields.
As far as statistics for accuracy with TELEform, the numbers that I reported are as follows (the numbers represent the percentage of fields with ICR errors of the specified type):
ICR Error type:
==========
Handwriting 3.7%
Combination Handwriting/OCR 3.2%
OCR 2.9%
Total 9.9%
You can take the first three numbers with a grain of salt (the numbers based on what kind of ICR error occured are subjective and somewhat antecdotal) but the total is accurate -- expect to have approximately 9.9% of your fields come back with errors, and around 6% if you are really careful in desigining your forms and train your users on how to write on those forms. These numbers are consistent for all of the ICR systems I have used.
I hope this helps...
-Turkey
Two Words (Score:3, Informative)
You will spend less time and get better results hiring starving, desperate for money, college students to do data entry.
Re:Two Words (Score:1)
Just google for data entry outsourcing [google.com].
Re:Two Words (Score:2)
OCR is just too unreliable. The best of them are only around 99% accurate with typewritten pages which still means lots of errors per page. OCR of handwritten input is a joke. HUMAN interpretation of handwritten data isn't even totally accurate due to the piss poor handwritting some people have.
If you can minimize the amount of "essay" data and maximize multiple choice with fill-in-the-dot you will be able to lower your error rate.
You don't say... (Score:2)
If it's handwritten, just forget it. You'll have enough problems getting people able to read it, much less computers. The postal service does do some of this, but they have a secret: they know all the valid addresses and can do cross-referencing between different parts if they really have to.
If it's typed you might be able to OCR it, but don't count on it being truly reliable and plan on saving the image as well - you're going to have to be able to go back to it.
If it's filled and printed you might be able to put something together, but if you have that why not have them send the data electronically instead?
If it's fill and print but you can't communicate it electronically, see if you can generate barcodes. If this is the case, I assume it's generated by an application instead of an HTML form, since an HTML form could be communicated back to the server.
The main situation I know of where OCRing worked well was for an imaging system - the company wanted to store images of all the work order pages for each customer (to include signatures & handwritten notes), tied back to the database. Since the initial work orders were printed, all that needed to be OCRed was the work order number, which both included check digits and needed to match against a known work order already in the database. Even then, there were provisions for dealing with the ones that weren't recognized. Barcodes weren't used because the imaging system was separate from the creation system, which didn't have the capability to generate them.
Captiva (Score:2, Insightful)
I have worked in OCR/forms-processing, etc. (Score:2, Informative)
eiStream (Score:1)
the WMS division of the company produces software that focuses on digitization of processed forms, OCR, workflow management and the like. the trickiest part of this process, however, may be for their off the shelf product to incorporate all your specific aspects (or requirements) into one application/process. i would also be interested to see how they handle your request as it nearly fits an as of yet (and perpetually unreleased) new product. regardless, it shouldn't be impossible to accomplish what you seek with what they have available.
as with many companies, they have an professional services group willing to come on site to customize the application to your project, performance tune the hardware and software, and help you learn the product some...
the core OCR software can be sampled with a 60 day trial download (link [eistream.com]) of the Imaging Pro (Windows users only...) software. the obvious down side to this company (other than my layoff), is that the required server hardware and licensing costs can become cumbersome. it really all depends on the scale/scope of the project and how many documents you seek to simultaneously process... the PSO group can scope that for you before hand though if you want. oh yeah, and its nowhere near open source...
I have worked on OCR systems. (Score:1)
Accurate OCR on forms requires a lot of custom work. The costs would be prohibitive for a one-time study.
OCR is only accurate if the problem's constrained. For example if you know the entry is an address or phone number.
Free form cursive is "impossible" to do accurately. Free form printed is very, very hard. You'd probably have to restrict people to writing letters in consequtive boxes, which most people hate.
A lot of work will go into designing the questions and the layout of the form itself. And how will your scanner be fed, mechanically or by hand? A degree of skew will ruin your data unless your system is designed to handle it. (Ever fill out a form with large dots in each corner?)
After all that, you'd still need humans & a data entry system to check the borderline cases. A good system would have a data correction station that flashes portions of images and the system's best guesses for the operator to choose among or replace outright.
Don't consider 'off-shore' data entry. (Score:1)
Try using an email-based form instead (Score:2, Insightful)
I have created a system to help in reviewing proposed changes to the Convention on International Trade in Endangered Species(CITES)which sends out a formatted email form to reviewers (many of whom are in developng countries). The reviewers reply with their answers and a simple RegEx sucks the answers into a database.
Benefits of this approach:
||Please tell us your shoe size::
||What color are your shoes?::
The directions originally told users to enter data between the
||Please tell us your show size:My size is twelve:
and caused the RegEx to miss the answer.
Best of luck
Skip the sledgehammer, get the right tool (Score:2)
Sometimes simple brute force does wonders.
have you considered . . . (Score:1)
1) What's the daily/weekly volume of forms/pages? If it's really only a few hundred or even several thousand pages per week, it may not be worth it. In my project, we had several deadlines where we had to turn around ten thousand forms (5-8 pages each) in 3 days - using OCR to increase the throughput quickly became cheaper than hiring temps to do data entry.
2) What kind of data are you collecting? As several others have mentioned, form UI design is intimately connected with the type of data you need and the quality tolerances for the data. If most of the questions can be answered in checkboxes, then OCR is a good bet and will probably save you some time. Actually the vendors refer to OMR (object mark recognition, or checkboxes/bubbles), OCR, and ICR (intelligent character recognition, usually handprint). OMR accuracy is >99%, OCR is something like 95-99% given a good font, and ICR is somewhat less. We had great success with OMR/checkboxes, but ICR recognition of handprinted address data, free response questions, etc. was quite poor. Even if each form contains a mixture of checkboxes and free response questions, it may be worth it to implement an OCR-type solution because "heads-up" data entry of the free response questions is quicker than having to look down at a stack of papers and leaf through them.
3) Are the forms already designed and printed, or do you have some freedom in working with designers to adapt them for OCR use? Using an OCR solution has the side benefit of forcing you to be really careful about how you ask questions and how you lay out the form - in other words, it really forces you to constrain certain lines and boxes, making things that much more explicit for the user. We went through 6 or 7 revisions of one of our forms before we found a format that didn't confuse anyone. I've got to think that many times human data entry operators don't bother to type in the extra notes/margin comments/etc. on manually entered forms.
4) How many sites are there? In our case, we had 1 site where all of the forms were mailed. We had a dedicated form processing area with 5 big workstations, $30K of software (we used ReadSoft, more on that later), and $15-20K for three medium-sized Kodak scanners. The workflow is basically scan-->perform OCR/ICR/OMR recognition -->manually validate any fields that were not recognized --> write to a flat file or write directly to database using ODBC.
SCAN - If you have lots of sites, paying $2-3K for a low-end duplex scanner for each site may not be realistic. I'm increasingly interested in the low-end HP printer/scanner machines, as these allow you to scan to a network location or email (starting at about $700).
RECOGNITION - ideally this would happen in one location, because you're generally paying the bulk of the license fee for this.
VALIDATION - both ReadSoft (http://www.readsoft.net) and Captiva (http://www.captivacorp.com) claim to support both fat/proprietary and web-based clients for the validation by data entry people. I've never tested the web clients, but that might be a real selling point.
UPLOAD - we wrote the data to a flat file and then used a perl script to post each line of data to a JSP form - so each form was available both on the web and on paper - this turned out to be quite nice.
As far as vendors, ReadSoft and Captiva both seem to be pretty strong. Cardiff Teleforms is another decent product, although they have an integrated form-builder that is pretty limiting, compared to Readsoft's ability to adapt to pre-existing paper forms.
I looked long and hard at sourceforge and elsewhere, and though there are some OCR projects, there doesn't seem to be anything focused on form processing. Given the progress on the XForms standard and an alpha implementation of it in the development build of Cocoon, I'd love to see a project get started to build a forms OCR project into Cocoon. Anyone?
Scantron sheets (Score:1)