Paper to XML? 12
Scott Taylor writes "I have a paper manual that I would like to convert to an HTML browsable manual and to a text searchable PDF manual. Most of the pages of the manual use the same table layout (albeit an irregular table). My current thought is to scan in the tables and then somehow using OCR software convert the data in the table to a xml marked up file. From there I can use XSLT and FOP to convert the data to HTML and PDF. The problem is that I don't know how I can make the jump from a scanned in picture of a table to XML. Anyone out there tried this before? Is there any software that lets one mark up OCR text based on the table cell it was found in? I don't mind spending money on commercial software if necessary (as long as it doesn't cost too much). Is there a better to solve the problem?"
One option springs to mind (Score:2)
--Dan
XML, paper: bad keyword (Score:2)
What you need to look for is the right OCR software [google.com]. The prime consideration is that it's good at figuring out tables. If you find something good, don't rule it out just because of limited/nonexistent XML support. The software will certainly support Comma-Separated-Values, or something similar, and that's easily translated into XML.
Verification (Score:4, Informative)
Of course, it might be easier just to find the original source (be it MS Word or whatever) and find software to convert that. Even if you have to get on the phone and track some people down it would be easier than re-proofing and re-tweaking.
Re:Verification (Score:2)
Even if you have to get on the phone and track some people down it would be easier than re-proofing and re-tweaking.
That is the best advice.
Don't resort the heavy duty technical solution to the problem before exhausting the lower tech solution to the problem first.
Half a day's phone calls and tracking down the current location of your document's author is easier than what you're proposing to do.
Even then, before resorting to OCR to XML, go with the other suggestions of getting a good typist to regenerate the document.
You can do a favor to your successor by including some highly specific URL for the new document showing when it was created, what format it is in, and where on some networked device it is living at the time that the paper was produced.
It's also a nice gesture to publish the new document in several formats to increase the liklihood of future translatability.
No matter how "universal" a particular format happens to be these days, by presenting your document in more than one format you increase the effective lifetime of the document.
Forest for the trees (Score:5, Informative)
You'd might do well just hiring a company to retype this in for you. There are any number of such that do this at fairly reasonable prices, locally & off-shore. They'll likely be cheaper then your going the research/purchase/assemble/scan/OCR/review/output cycle on your own, particularly for a one-off.
As to XML vs. anything else - who cares. Get it into any reasonable format and without too much effort you can convert it to another. Unless you're trying for buzzword-bingo just accept any text & table formats & post-process to your favorite flavor.
Re:Forest for the trees (Score:1)
DocBook XML (Score:1)
You've got two problems here:
As for the first, I don't think that you're going to find a completely automated solution. First of all, OCR isn't terribly accurate to begin with. Second, you have to convert OCR'd plain text to marked up XML. You'll probably have to do this by hand unless the manual you're entering is terribly structured.
However, I'd certainly recommend using DocBook [docbook.org] as your intermediate XML format. It's a well-designed language targeted at technical manuals. Don't re-invent the wheel with your own XML format and XSLT style sheets.
DocBook supports RTF, PDF, HTML, PS, LaTeX, and other output formats. Do yourself a favor and use it.
--Bruce
Acrobat (Score:1, Interesting)
FineReader (Score:1)