Sanely Moving from Word to the Web? 547
FooAtWFU asks: "I have a job for a web site (no link for you, Slashdot hordes!). A lot of it is systems administration and development, but I have to routinely post content which comes from a myriad of other sources. Usually they are from academic users, come in Word format, and ultimately need to be posted in HTML. The problem is that Word has all sorts of tricks up its sleeve to throw off the font, layout, size, and so forth. To achieve any sort of visual consistency on the site these various formatting tags all need to be scrubbed, but even using other office suites with better HTML export (OpenOffice.Org) to do the dirty work, it's often easier to recreate the formatting by hand from a plain-text version than it is to clean up a sea of messy tags. Does anyone have any advice (or magical tools) to help me deal with this sort of tedious cleanup?"
Scrapping (Score:5, Interesting)
1. Place all "to-process" documents in a specific folder in a webserver
2. Write a script to read those documents
3. Use Regex (and similar functions) to strip off and/or replace specific tags/wordings (similar to web scrapping technique).
Admittedly it was a tedious job at first to identify every possible template, however I'm amazed how predictable some documents are and once you get hold of such "blueprint", you can reformat documents to HTML/XML fairly easily.
Once the changes are done, I then preview them in a browser, and if everything's expected, I simply save the page and use it; If not, it's easy enough to make a few tweaks from the familar HTML environment.
Sounds like you should release on sourceforge (Score:5, Interesting)
Get it in PDF first. (Score:3, Interesting)
What I would do in your shoes is set up a (mostly) automated system to convert the Word files to PDF. You can buy Acrobat or you can go with a third-party, printer-driver-style converter, but in the end you'll probably save more headaches just using Acrobat.
Once you have a document in PDF, you can use any of the numerous (free and commercial) tools to convert that to HTML, text, whatever - all much more reliably than from Word directly. It's not perfect, but it's probably the closest you'll get.
Plus, you can post the PDFs themselves for download in case someone wants them - and at least Google will still happily index your PDFs.
Yes, you'll probably have to live with some NT variant to get that part done (though it might work with OSX) - but it's most likely your fastest path to *quality* conversions.
Resign from your executive position (Score:3, Interesting)
Comment removed (Score:3, Interesting)
Re:PDF? (Score:3, Interesting)
However, if someone is getting the idea for another open source project to solve this dilema then I'd suggest something that can render DOC to HTML on the server side. That would allow those who just know how to "setup" a webserver to sit back and let the software deal with people's problem with not using standard types. Parse the Word, Wordperfect, OpenOffice, RTF, whatever and render it in HTML. This would allow anyone in a company dump the document on the server/share and let it be viewed by anyone else.
But there are limitless options like this http://www.doc-api.com/ [doc-api.com] found on google...
Actually, an NDA probably doesn't matter. (Score:3, Interesting)
However, I am also not willing to just assume that no company would ever consider letting someone sourceforge a script like this. It is 1) worth good advertising and 2) clearly not important enough to be worth selling. Release it in the company's name, or not depending on what they prefer.
At a minimum a lot of small companies would be fine with this - big companies would vary wildly.
Re:Scrapping (Score:2, Interesting)
Alternately, you could use Writer2Latex [google.com] to generate XHTML 1.0 strict for yourself.
Those two methods seem the easiest.
Re:no link for you, Slashdot hordes! (Score:3, Interesting)
Re:Dreamweaver (Score:1, Interesting)
Trying this again (Score:3, Interesting)
s:STYLE=\"[ a-zA-Z0-9\:;-]*\"::Ig
s:</FONT>::Ig
s:<FONT[ -=\"A-Z0-9]*>::Ig
s:BORDER=[0-9]*::Ig
s:ALIGN=B
Re:Convert to RTF first (Score:3, Interesting)
Re:OpenOffice 2.0-beta "save-as" and "export" grea (Score:3, Interesting)
Perhaps he could use the OpenOffice API to automatically have a server-side instance of OpenOffice open submitted Word documents and save them as HTML. This should happen at the same time the user uploads the document - that way the user could preview the conversion to HTML, and if it was flawed, he could choose to publish the document as PDF.
OpenOffice API:
http://api.openoffice.org/ [openoffice.org]
Code snippet shows simplicity of converting OpenOffice Writer SXW document into PDF:
http://codesnippets.services.openoffice.org/Write
Perhaps a few small changes here would get him what he wants.
Perl interface (ooolib):
http://ooolib.sourceforge.net/doc/ooolib-0.1.5-do
There are also Java code snippets. I think it would be possible to convert the OOBasic snippet above to either Java or Perl.