Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×
Books Media Software

Best OCR for Technical Texts? 28

An anonymous reader asks: "I'm scanning in user manuals for older lab equipment. I've never used OCR before today, so I installed the Caere Omnipage 9.0 that came with the scanner. I was pretty happy except for a few things. It doesn't seem to want to recognize engineering symbols like the one char +/-,square root, omega, simple equations, it has trouble with super- and subscripts, and it outputs funky Word files. For example, from an 8.5 x 11 original page scanned in at 1 bit at 300 dpi, the output Word file was 10 inches wide, used tons of Omnipage text styles and didn't match the original text's flow. It did do a good job of italicizing headers and recognizing the various sections in a two column page. Googling the news and net just backs up my claims but provides no real solution. A Google search that provides nothing useful looking for best OCR for engineering."
This discussion has been archived. No new comments can be posted.

Best OCR for Technical Texts?

Comments Filter:
  • Comment removed (Score:5, Informative)

    by account_deleted ( 4530225 ) on Thursday May 08, 2003 @10:22AM (#5909902)
    Comment removed based on user account deletion
    • Re:Clara OCR (Score:3, Informative)

      by PerlGuru ( 115222 ) *
      Though I haven't used Clara OCR I went to that page and it looks like it might work for you. It looks like it learns the font for the page and once you tell it what the symbol is once it learns that and uses it the rest of the time you tell it to use that font. Looks like something I am definatley going to try, what could it hurt, it's opensource so no money out of pocket.
  • Good Luck! (Score:4, Interesting)

    by Asprin ( 545477 ) <gsarnold@yahoo.cMOSCOWom minus city> on Thursday May 08, 2003 @10:32AM (#5909984) Homepage Journal

    Good luck!

    I've used a few different version of Omnipage PRO, and it works OK if the layout is not complicated, it uses standard fonts, the text is clean and clear and it doesn't have too many weird logos or symbols. You still have to proofread everything and correct it by hand, though, so I'm not convinced it's a time saver as much as it is a typing saver.

    OmniPage Pro does do a MUCH better job of identifying words that the free version they throw in with scanners because it uses spelling and grammar checkers to help ID words from context. The free version is as close to useless as you can get in the software world - it's really just an ad for Pro.

    Engineering and math symbols are right out.

  • Try Different (tm) (Score:2, Informative)

    by coyote4til7 ( 189857 )
    Have you tried other combinations of settings (e.g. dpi, bit depth)? That won't solve all of the problems you talk about it, but playing with those settings in each package you look at _before_ rating how good it is is important.
  • Finereader (Score:3, Informative)

    by Marc Boucher ( 235490 ) on Thursday May 08, 2003 @10:37AM (#5910025) Homepage
    You can try FineReader from ABBYY [abbyy.com]
  • Use Greyscale (Score:5, Informative)

    by jayrtfm ( 148260 ) <jslash@soph[ ].com ['ont' in gap]> on Thursday May 08, 2003 @10:40AM (#5910048) Homepage Journal
    Use 8 bit, NOT 1 bit. When I switched from 1 to 8 bit on a page of normal text, the dozen or so errors vanished.

    Since Omnipage is up to version 12, perhaps there's been an improvement since your version.

    Your google skills are sorely lacking, the "Hacking Google" book would be a good investment for you. Eliminating the quotes and word "best" in your search string would help.

    2 different free web based ocr, just upload a 300 dpi b/w (8bit greyscale) file
    http://www.expervision.com/webtr6.htm
    http: //docmorph.nlm.nih.gov/docmorph/

    here are some OCR programs

    http://www.scansoft.com/omnipage/

    http://www.abbyy.com/

    http://www.newsoftinc.com/redir/digitaloffice_al l. asp?category=ocr4

    more ocr links than you really want
    http://web3.humboldt1.com/~jiva/ocr/_ocr_res ource. htm

  • ICR, Google, etc (Score:3, Informative)

    by Strange Ranger ( 454494 ) on Thursday May 08, 2003 @01:10PM (#5911342)
    What you really need is ICR, Intelligent Character Recognition. There is a free trial version of one such product here [pegasusimaging.com].

    Better [google.com] Google [google.com] searching [google.com] makes [google.com] the difference [neurascript.com].
  • Anyone know of a decent home-use scanner with a letter-size sheetfeeder?

    Can it also take a stack of 4x5 photos?
  • The Best! (Score:4, Funny)

    by FortKnox ( 169099 ) on Thursday May 08, 2003 @01:56PM (#5911759) Homepage Journal
    The Best OCR scanner is an intern with a pencil. ;-)
  • Oh boy, it's been a long time, but when I evaluated OCR packages I found that each of the 3 major choices at the (OmniPage, TypeReader, and one whose name I no longer remember and is no longer available) could do better than the others on certain types of my test documents. So you might want to get a trial of TypeReader (www.expervision.com) and try it out, in addition to FineReader as suggested by another poster.
    • I recently completed a project to OCR many 10s of thousands of technical documents. I ended up choosing TypeReader. I have to say that TypeReader beat the hell out of every other OCR program I found. It was extremely fast, extremely accurate, had a bunch of output formats to choose from, and offers a batch mode that worked wonders for what I had to work with.

      I HIGHLY recomment TypeReader!

      Paul
  • Abby Fine Reader
  • I've found that the best is to leave the scans as image files and bundle them together as a multipage TIFF or PDF file. It takes more space to store them, but you don't have to mess around with OCG (Optical Character Guessing).

    Easy to access and read. The only loss is you can't do cut and paste or text searching.
  • Scan your manuals in 300dpi or 600dpi grayscale and archive them that way. This will allow you to go back and use better OCR as OCR technology improves.

    Then, use something like Adobe Acrobat to put them on-line: Acrobat uses OCR internally to make the text searchable, but it still displays the original page image. That means that formulas and appearance will be preserved even if the OCR screws up.

  • Prime OCR is what I have always seen in corporate sectors. At both Ford Motor Company, and Pfizer Drugs it is used for OCR on very complex technical documents.

    http://www.totalsol.com/products/doc_process/prime recognition/product_prime.html

  • How about Gamera Gamera: A Python-based Toolkit for Structured Document Recognition http://dkc.mse.jhu.edu/gamera/papers/gamera_python _2002/gamera.html

After a number of decimal places, nobody gives a damn.

Working...