Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
Communications The Internet

Sanely Moving from Word to the Web? 547

FooAtWFU asks: "I have a job for a web site (no link for you, Slashdot hordes!). A lot of it is systems administration and development, but I have to routinely post content which comes from a myriad of other sources. Usually they are from academic users, come in Word format, and ultimately need to be posted in HTML. The problem is that Word has all sorts of tricks up its sleeve to throw off the font, layout, size, and so forth. To achieve any sort of visual consistency on the site these various formatting tags all need to be scrubbed, but even using other office suites with better HTML export (OpenOffice.Org) to do the dirty work, it's often easier to recreate the formatting by hand from a plain-text version than it is to clean up a sea of messy tags. Does anyone have any advice (or magical tools) to help me deal with this sort of tedious cleanup?"
This discussion has been archived. No new comments can be posted.

Sanely Moving from Word to the Web?

Comments Filter:
  • Use antiword (Score:2, Informative)

    by shura57 ( 727404 ) * on Tuesday August 09, 2005 @05:52PM (#13282130) Homepage
    It takes Word file and spits out plain text. It can also do some more tricks.
  • Dreamweaver (Score:5, Informative)

    by necro2607 ( 771790 ) on Tuesday August 09, 2005 @05:52PM (#13282131)
    I would suggest using Macromedia Dreamweaver... it's what we use where I work and essentially all of our content entry involves pasting in content from Word documents supplied by clients. Dreamweaver is pretty good for formatting and working with stylesheets.
  • Antiword (Score:3, Informative)

    by alanp ( 179536 ) on Tuesday August 09, 2005 @05:53PM (#13282135) Homepage
    Try antiword, it's got a real decent HTML option.
  • Textism (Score:5, Informative)

    by NoInfo ( 247461 ) * on Tuesday August 09, 2005 @05:53PM (#13282137) Homepage Journal
    Here's a tool I saw linked off of O'Reilly Radar once:

    http://textism.com/wordcleaner/ [textism.com]

    I used it once and it did a pretty decent job at preserving the tables. Yet if they're using anything odd like graphics or it's been incredibly tweaked, it probably won't be 100% perfect.
  • HTML Export (Score:3, Informative)

    by electroniceric ( 468976 ) on Tuesday August 09, 2005 @05:55PM (#13282160)
    If you're using Office 2000, you can find the HTML filter here:

    http://www.microsoft.com/downloads/details.aspx?Fa milyID=209ADBEE-3FBD-482C-83B0-96FB79B74DED&displa ylang=EN [microsoft.com]

    I believe this functionality is built into later versions of Word.

    Per the site, this produces simpler HTML with Office-specific tags removed. With that done, you could probably use a PERL script, and you might also try writing some Word macros or COM/VBA scripts that clean up the document from within Word.
  • Dreamweaver (Score:5, Informative)

    by SlashChick ( 544252 ) * <erica@eriTOKYOca.biz minus city> on Tuesday August 09, 2005 @05:55PM (#13282162) Homepage Journal
    Check out Commands -> Clean Up Word HTML in Dreamweaver. it does a nice job of getting rid of extraneous tags. While you're at it, take a look at Commands -> Apply Source Formatting as well. This can be customized to your specifications in the preferences section, and automatically tabs out, adds newlines, and converts tags to lowercase where appropriate in the HTML document. Dreamweaver is the closest thing I know of to a program that "automatically" cleans up Word HTML.

    Good luck!
  • HTML Tidy (Score:5, Informative)

    by N8F8 ( 4562 ) on Tuesday August 09, 2005 @05:56PM (#13282166)
    Save the Word document as filtered HTML and pipe the HTML through HTML Tidy [sourceforge.net]. Nice clean HTML.
  • Re:Dreamweaver (Score:5, Informative)

    by fean ( 212516 ) on Tuesday August 09, 2005 @05:56PM (#13282172) Homepage
    in Dreamweaver, there's a command "Clean up MS Word HTML". Its made to clean up Word's crappy html, and does a pretty nice job of it.
  • by SeanTobin ( 138474 ) * <{moc.liamtoh} {ta} {rtnuhdryb}> on Tuesday August 09, 2005 @05:56PM (#13282174)
    Hmmm... sounds like a challenge to me. Let's see what we can dig up.

    Step 1: Let's look at his user page [slashdot.org]

    Ahh! He put in a website with his profile. Let's all go and check out http://fennec.homedns.org/ [homedns.org]

    Hmm... looks like a personal page. Not too sure what to make of the comic. Anyway, let's move on to..

    Step 2: Let's look at his author [wfu.edu] page. Some interesting stuff here, including three separate e-mail addresses (which I won't post here. You're welcome :)

    A-ha! There is a link to his employer! It's Economic History Services [eh.net]. And what do you know... there are a significant number of pages (especially under abstracts and book reviews) that seem to come straight out of a word processor, only with extensive cleaning. A quick look at the source reveals something interesting. It's clean. Very clean. We're talking on the level of I-use-vim-for-my-webpage-editor clean. Nice job.

    Anyway, it looks like it was done by hand. I'm not saying its not good work (quite to the contrary), but I can see your need for an automated solution.
  • by Anonymous Coward on Tuesday August 09, 2005 @05:58PM (#13282183)
    Why use editplus when you can use Crimson Editor when its free, open source, and has all the capabilities of edit plus, functionality, and then some? (http://www.crimsoneditor.com/ [crimsoneditor.com]) The built in macro functionality is really sweet too!
  • HTML Tidy (Score:3, Informative)

    by John_Booty ( 149925 ) <<gro.tcejorpytoob> <ta> <ytoobnhoj>> on Tuesday August 09, 2005 @05:59PM (#13282191) Homepage

    HTML Tidy has a special mode for cleaning up Word's crappy HTML export. HTML Tidy is a free command-line tool that is also embedded in a lot of popular HTML editors.

    HTML Tidy:
    http://tidy.sourceforge.net/ [sourceforge.net]
    HTML Kit (great integration with HTML Tidy; it includes HTML Tidy so you can just grab HTML Kit without grabbing HTML Tidy)
    http://www.chami.com/html-kit/ [chami.com]

    Countless other editors integrate with HTML Tidy as well. Have fun and good luck!

  • Tidy Flags (Score:5, Informative)

    by N8F8 ( 4562 ) on Tuesday August 09, 2005 @06:00PM (#13282206)
    Almost forgot. The Tidy Docs [sourceforge.net] will tell you to select "--bare" and "--word-2000" and I also recommend "--output-xhtml" and "--indent".
  • Convert to PDF (Score:1, Informative)

    by Anonymous Coward on Tuesday August 09, 2005 @06:01PM (#13282212)
    Adobe Acrobat PDF conversion preserves look

    Many free or cheap printing filters / converters available
  • Beautiful Soup (Score:1, Informative)

    by Anonymous Coward on Tuesday August 09, 2005 @06:02PM (#13282221)
    If you like Python, there is an app out ther called Beautiful Soup [crummy.com] which can suck in ugly, malformed markup and give you a parse tree you can play with before dumping it back out to html.

    P.S. There is a Ruby Port [crummy.com] as well.

  • Re:Textism (Score:4, Informative)

    by e**(i pi)-1 ( 462311 ) on Tuesday August 09, 2005 @06:02PM (#13282222) Homepage Journal
    a standalone Perl script, I use daily is demoronizer [fourmilab.ch].
  • Re:HTML Export (Score:5, Informative)

    by Marxist Hacker 42 ( 638312 ) * <seebert42@gmail.com> on Tuesday August 09, 2005 @06:05PM (#13282244) Homepage Journal
    Whew- I hoped I didn't have to post this 40 comments down in the thread. Yes, Office 2000 has the above tool- and Office 2002 or 2003 has it on the Save As menu. The option you want is "Web Page (filtered)|*.html". I saw an interview once with somebody on the Word development team, and he claimed that the original Save As HTML was built for passing Word Documents over the web- and never meant to be read by human beings as a web page at all. Web Page (filtered) cuts out all the extra shyte that Save As HTML used to put in for managing version controled updates and changing the font every bloody character- and builds a real web page.
  • fckeditor (Score:3, Informative)

    by mixmasterjake ( 745969 ) on Tuesday August 09, 2005 @06:10PM (#13282274)
    fckeditor [fckeditor.com] is an in-browser WSYWIG. It has a "Paste from MS Word" button that actually strips out a lot of the unecessary baggage. I don't know how well it handles embedded images or tricky layouts, but for the basic stuff it works well.

    The interface is similar to Word - maybe if you're lucky, you could get some of your content producers to use it.

  • HTML Tidy program (Score:5, Informative)

    by Todd Knarr ( 15451 ) on Tuesday August 09, 2005 @06:10PM (#13282278) Homepage

    One program I've had luck with is the HTML Tidy program at http://www.w3.org/People/Raggett/tidy/ [w3.org]. It seems to clean up code (particularly from Word) quite a bit.

  • by room101 ( 236520 ) on Tuesday August 09, 2005 @06:12PM (#13282293) Homepage

    Using a modern version of Word, output in WordML (xml format). Use a XSL stylesheet [antennahouse.com] to convert the WordML to FO (formatting objects).

    From there, do anything you want, like XHTML or PDF.

    Or just go to XHTML from WordML with some stylesheet. XSL is teh cool!

  • by netringer ( 319831 ) <maaddr-slashdot@NospaM.yahoo.com> on Tuesday August 09, 2005 @06:16PM (#13282312) Journal
    Net-It Central [netit.com] is the magical tool you were looking for. With that you can just point it at the file share with the Word Documents (and Excel and Power Point...) on it and see them indexed and cross linked on web pages. It'll update the content as the source docs change.

    Oh, you mean non-commercial magical tools?
  • Re:Dreamweaver (Score:3, Informative)

    by ozbon ( 99708 ) on Tuesday August 09, 2005 @06:21PM (#13282341) Homepage
    Yeah, I found that the "clean up word docs" did about 80% of the work, and then it was just a matter of a buttload of search/replace stuff in order to get it to finish the rest.

    Worked pretty well, once I'd got the search/replace stuff sussed out.

    Mind you, on a big word file you can think it's crashed when actually it's just doing lots of thinking...
  • Re:Scrapping (Score:3, Informative)

    by cloudmaster ( 10662 ) on Tuesday August 09, 2005 @06:21PM (#13282342) Homepage Journal
    For simplifying documents, I've found HTML::TreeBuilder to be a handy module. Then you just have to write code to simplify HTML (throw away useless tags, merge adjacent tags, etc), rather than worrying about reformatting word docs.
  • Re:Tidy Flags (Score:2, Informative)

    by Bogtha ( 906264 ) on Tuesday August 09, 2005 @06:27PM (#13282366)

    I also recommend "--output-xhtml"

    Why? XHTML isn't any better than HTML 4.01 for almost anybody, and it's less compatible.

  • Pagify (Score:3, Informative)

    by bckspc ( 172870 ) on Tuesday August 09, 2005 @06:32PM (#13282390) Homepage
    Pagify [backspace.com] is a perl script I wrote to do this for another job. It's basically a series of regular expressions that: 1. purges all the proprietary XML gunk from the HTML file you save from Word. 2. chops the file into smaller files wherever a Heading 1 appears 3. attaches endnotes as footnotes to the appropriate pages. It's GPL'd, so go nuts.
  • Try this.... (Score:5, Informative)

    by mormop ( 415983 ) on Tuesday August 09, 2005 @06:32PM (#13282391)

    Demoroniser is, in the author's own man pages words:

    A Perl script which corrects incompatible HTML generated by Microsoft applications. [fourmilab.ch]

    You can get it from the link in the same page. I must confess that I've not used it myself (don't use Office/Frontpage) but if it does what it says on the tin it should sort you out.

  • Re:Use antiword (Score:3, Informative)

    by cloudmaster ( 10662 ) on Tuesday August 09, 2005 @06:36PM (#13282409) Homepage Journal
    He's obviously got access to Word, and saving as text in Word at least preserves most of the whitespace-based formatting. IIRC, antiword is mostly useful as a last-ditch effort to read a word doc, a step above piping the doc through "strings". :)
  • Dreamweaver MX 2004 (Score:3, Informative)

    by greymond ( 539980 ) on Tuesday August 09, 2005 @06:45PM (#13282479) Homepage Journal
    1) Open Word
    2) Select All -> Copy
    3) Open Dreamweaver
    4) File -> New Html Doc
    5) Paste
    6) Commands -> Clean up Word Html
    7) Commands -> Apply Source Formatting (if you take the time to set the programs preferences to what you like)
    8) Done
    9) Drink beer
    10) Sleep
  • Re:DEMORONISER (Score:3, Informative)

    by Kelson ( 129150 ) * on Tuesday August 09, 2005 @06:51PM (#13282525) Homepage Journal
    The Demoroniser was nice in its time, but it assumes the output should be 7-bit ASCII, or ISO Latin-1 at best.

    The Unmoroniser is an updated version that handles Unicode properly and will do things like convert proprietary Windows-only curly quotes to the appropriate HTML4 entities instead of dropping them back to less accurate, typographically offensive straight quotes. Same with ligatures and other characters that the Demoronizer would munge instead of convert.

    http://rheme.net/unmoroniser/ [rheme.net]
  • by FooAtWFU ( 699187 ) on Tuesday August 09, 2005 @07:02PM (#13282604) Homepage
    My SSH connection to my server still lives; I think my task was accomplished well enough. :)
  • Re:PDF? (Score:3, Informative)

    by FooAtWFU ( 699187 ) on Tuesday August 09, 2005 @07:16PM (#13282678) Homepage
    Not quite what I'm looking for. Maybe I should clarify: I want to remove the nonessential formatting, while keep certain niceties (in particular, italics for the names of papers they reference, hyperlinks for footnotes, etc) and convert the rest into something simple and plain with just-the-basics of HTML, so I can then style it to match the other pages on the site. Many of these documents go to collections: encyclopedia articles, book reviews, abstracts of papers. If they don't look consistant, then people do complain. (And my site has enough formatting-consistency issues as it is ;)
  • by Lemuridae ( 741647 ) on Tuesday August 09, 2005 @07:31PM (#13282764)
    From the AC above:

        1) get a copy of Word 2003
        2) "save as" an exemplar as XML
        3) write an XSLT to render it in a HTML with stylesheets etc as appropriate to your website
        4) for every document you get, "save as" XML with the XSLT from 3) as the transformation.
        5) publish

    I've been wondering how long until using XSLT and XML was suggested. XML is supposed to be a common data transport format but most of the other comments talk about starting with tranformations to Word HTML. This is wrong because it assumes that the Word to HTML conversion will produce usable HTML in the first place which is a bad assumption.

    The solution suggested by the AC could be combined into a program that drives the entire process using the Word COM API to save to XML and then then, for example, the MS Jet XSLT COM object model to automate the XML conversion. This could easily be maintained (eg: new Word formatting not previously encountered) with small changes to the XSLT.

    If the desire is to completely control the output without having control of the input then this is the best way to go. Yes, it's a bit of work but once you have a maintainable turn-key system you will save a lot of futzing with manual formatting. Use the power of XSLT.
  • Re:Dreamweaver (Score:5, Informative)

    by Anonymous Coward on Tuesday August 09, 2005 @07:36PM (#13282792)
    Also note that you have the ability to cut and paste formatted text from Word into the 'Design View' within dreamweaver and DW will automatically reformat the incoming text appropriately. In my brief test to make sure i wasnt talking out my a** i found it even supports word tables properly.
    If you paste text into the Code view, DW removes the formatting completely and just uses the raw text.
  • Re:Textism (Score:2, Informative)

    by bjdevil66 ( 583941 ) on Tuesday August 09, 2005 @07:38PM (#13282802)

    Textism looked pretty cool, so I tried it out with a typical .htm export (67K). However, it requires a subscription (Paypal payment) to process files larger than 20K. In my experience many .htm files pumped out by MS Office are larger than 20K, so I imagine the submitter may want to look elsewhere...

    With that said, if the people that have assigned the submitter with the web work want their employee to have the tools they need to do quality web work, they should pay for quality tools so the submitter can get the job done.

    Side note: The Dreamweaver "Command..." option suggested below just worked great on the same file.

  • Re:Dreamweaver (Score:2, Informative)

    by kderby2000 ( 898113 ) on Tuesday August 09, 2005 @08:14PM (#13282991)
    Macromedia Contribute might be a better tool. It's (basically) a web browser, where you surf to a page, hit "edit", make your changes, and it posts back to the site. You can also upload Word & Excel. A site can also be set up to honor style sheets, Dreamweaver templates, etc.
  • Re:Dreamweaver (Score:1, Informative)

    by Anonymous Coward on Tuesday August 09, 2005 @08:36PM (#13283100)
    My experience:

    I have DW MX on a PC and it doesn't do the table copying and pasting from Word.

    On my Mac I can copy tables from Word and paste them in to DW MX 2004 and they'll drop in nicely, minus all the funky Word stuff. DW will also attempt to keep logical formatting in good HTML. Which is nice.

    Alternatively, depending on the complexity of the Word docs, consider saving them all out as plain text and then going back to format correctly. If yu have many tables and images this could be a pain, but if you have reams of text it may be much quicker.
  • RTF (Score:2, Informative)

    by yrte ( 833796 ) on Tuesday August 09, 2005 @08:49PM (#13283170)
    One option that can work for some situations is to export / save the file from .doc into .rtf (rich text format) and then use one of the free or pay RTF->HTML converters. I find using other software than Word to convert MSDOC -> RTF produces better results.

    Using that process has made preserving italics, bold, and special characters much easier for me and almost seems fully automatable.

    I've been using this method recently with some very simple search and replace and able to get good results.

     
  • by inphorm ( 604192 ) on Tuesday August 09, 2005 @08:54PM (#13283193) Homepage
    I personally use dreamweaver for coding.. I know, I know, all that gui overhead and only semi-compliant code if it generates it itself.. but it does have the useful clean up word html tool, then I get to working it over in pure code.

    works for me anyway..

    - paul

  • by sbma44 ( 694130 ) on Tuesday August 09, 2005 @10:18PM (#13283553)
    Are you using the telerik radeditor MCMS placeholder? It's free, and has capabilities that let you automatically strip out word formatting. In my experience it only sort of works... but it's better than nothing.

    You can also add an event handler for the updating event that does some regex tidying. Replacing the regex "]*>" will go a long way (better double-check that). You should be able to come up with a similar one for all the smarttag nonsense that gets inserted, too.

    Still, Word formatting remains a major bane to my existence. Good luck.
  • Re:Dreamweaver (Score:2, Informative)

    by KevinH456 ( 564212 ) on Tuesday August 09, 2005 @10:43PM (#13283646) Homepage Journal
    Dreamweaver MX 2004 goes one step farther and allows you to copy and paste word documents INCLUDING clip art and drawn images and graphs. It will automatically take all that and make it decent markup. It even picks up formatting. It strips all the word specific crap for you and then you can just format it as you like (using a stylesheet of course to make your life so much easier). From here you have nice html you can import into a CMS or code you can insert in your HTML template.

    We do this where I work. My job involves taking hundreds of word documents from professors and formatting them for online coursework at a major university in florida. Dreamweaver has made my life so much easier.
  • Re:Textism (Score:3, Informative)

    by deepestblue ( 206649 ) <slashdot@FORTRAN ... t minus language> on Tuesday August 09, 2005 @11:05PM (#13283740)
    The demoronizer is b0rk3n. See http://www.unicode.org/faq/unicode_web.html#2 [unicode.org]
  • by Futurepower(R) ( 558542 ) on Tuesday August 09, 2005 @11:46PM (#13283910) Homepage

    HTML Tidy cleans HTML, and has a special function for cleaning Word HTML junk.

    It must be terrible to work at Microsoft and always do mediocre work.

    --
    If you support dishonesty and violence [doonesbury.com], don't say you are Christian.
  • Try PureText (Score:1, Informative)

    by ZoomieDood ( 778915 ) on Wednesday August 10, 2005 @01:43AM (#13284289)
    See this site. [stevemiller.net] for PureText.
  • wvHtml (Score:3, Informative)

    by itomato ( 91092 ) on Wednesday August 10, 2005 @09:45AM (#13285657)
    http://wvware.sourceforge.net/ [sourceforge.net]

    From the sourceforge page:

    wv Utilities

    Provided with the wv distribution is an application called wvWare. wvWare is a "power-user" application with lots of command-line options, doo-dads, bells, and whistles. Less interesting, but more convenient, are the helper scripts that use wvWare. These are:

            * wvHtml: convert your Word document into HTML4.0

    (there are more utilities for LaTeX, etc..


    I'm using this to convert all of our internal documentation. It does a pretty good job, even converts the images and acts in a relatively reliable manner with 2003, 2000, & 97 formatted files. There's some oddball output sprinkled in, but nothing a little sed fanciness can't fix.

"But what we need to know is, do people want nasally-insertable computers?"

Working...