Sanely Moving from Word to the Web?

Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

Sanely Moving from Word to the Web? 547

Posted by Cliff on Tuesday August 09, 2005 @05:50PM from the tag-soup dept.

FooAtWFU asks: "I have a job for a web site (no link for you, Slashdot hordes!). A lot of it is systems administration and development, but I have to routinely post content which comes from a myriad of other sources. Usually they are from academic users, come in Word format, and ultimately need to be posted in HTML. The problem is that Word has all sorts of tricks up its sleeve to throw off the font, layout, size, and so forth. To achieve any sort of visual consistency on the site these various formatting tags all need to be scrubbed, but even using other office suites with better HTML export (OpenOffice.Org) to do the dirty work, it's often easier to recreate the formatting by hand from a plain-text version than it is to clean up a sea of messy tags. Does anyone have any advice (or magical tools) to help me deal with this sort of tedious cleanup?"

This discussion has been archived. No new comments can be posted.

Sanely Moving from Word to the Web?

Load All Comments

Search 547 Comments Log In/Create an Account

Comments Filter:

Scrapping (Score:5, Interesting)

by fembots ( 753724 ) writes: on Tuesday August 09, 2005 @05:51PM (#13282117) Homepage

Interestingly, I have a similar job on a website (no link for you too, Slashdot hordes!), here's what I do (I'm sure there are smarter ways):

1. Place all "to-process" documents in a specific folder in a webserver
2. Write a script to read those documents
3. Use Regex (and similar functions) to strip off and/or replace specific tags/wordings (similar to web scrapping technique).

Admittedly it was a tedious job at first to identify every possible template, however I'm amazed how predictable some documents are and once you get hold of such "blueprint", you can reformat documents to HTML/XML fairly easily.

Once the changes are done, I then preview them in a browser, and if everything's expected, I simply save the page and use it; If not, it's easy enough to make a few tweaks from the familar HTML environment.

Share
twitter facebook
- Sounds like you should release on sourceforge (Score:5, Interesting)
  
  by arete ( 170676 ) writes: <xigarete+slashdot@nosPam.gmail.com> on Tuesday August 09, 2005 @05:53PM (#13282136) Homepage
  
  So if there's only a few templates and they were a pain to work out, how about releasing your regex scripts to sourceforge or similar? Or posting here?
  
  Parent Share
  twitter facebook
  - Re:Sounds like you should release on sourceforge (Score:3, Insightful)
    
    by Anonymous Coward writes:
    
    Since they were produced at work, the copyright on them is probably owned by the company and not by him.
    
    Plus the templates are probably in-house templates and thus would be useless outside of the company.
  - Re:Sounds like you should release on sourceforge (Score:3, Funny)
    
    by extrasolar ( 28341 ) writes:
    
    See, if we were really elite, we would all automatically know how to do stuff like this with our favorite editor. There would be no Ask Slashdot. There's a reason why emacs is one of the most popular editors around and it's because it saves us from having to do this kind of repetative text work that should be done autonomously.
    
    But we're not elite and I'm now going to learn how to do macros in emacs :)
  - - Actually, an NDA probably doesn't matter. (Score:3, Interesting)
      
      by arete ( 170676 ) writes:
      
      In all likelyhood an NDA doesn't cover obvious works like this - anything that could be reasonably discovered publicly. Doubtless he couldn't post the _documents_ that he converted.
      
      However, I am also not willing to just assume that no company would ever consider letting someone sourceforge a script like this. It is 1) worth good advertising and 2) clearly not important enough to be worth selling. Release it in the company's name, or not depending on what they prefer.
      
      At a minimum a lot of small companies
      - Re:Actually, an NDA probably doesn't matter. (Score:5, Funny)
        
        by mrchaotica ( 681592 ) writes: on Tuesday August 09, 2005 @10:00PM (#13283485)
        
        Doubtless he couldn't post the _documents_ that he converted.
        
        You realize he was converting them for the purpose of putting them on a website, right? ; )
        
        Parent Share
        twitter facebook
        
        Re:Actually, an NDA probably doesn't matter. (Score:3, Insightful)
        
        by Evil Grinn ( 223934 ) writes:
        
        You realize he was converting them for the purpose of putting them on a website, right? ; )
        
        Did he say the website was public?
  - - Trying this again (Score:3, Interesting)
      
      by einhverfr ( 238914 ) writes:
      
      The script (decss.sed) is: s:STYLE=\"[ a-zA-Z0-9\:;-]*\"::Ig s:</FONT>::Ig s:<FONT[ -=\"A-Z0-9]*>::Ig s:BORDER=[0-9]*::Ig s:ALIGN=BO TTOM::g
- Re:Scrapping (Score:3, Insightful)
  
  by dougmc ( 70836 ) writes:
  
  Use Regex (and similar functions) to strip off and/or replace specific tags/wordings (similar to web scrapping technique).
  Of course, you really can't properly parse html just using regular expressions. You can get it right 90% of the time relatively quickly, and a day or so work will get you 5% more, but you could spend weeks trying to get that last 5% -- and never quite get it.
  It's really better to use things that other people have made for parsing html. For example, if you use perl (and you shou
  - Re:Scrapping (Score:3, Informative)
    
    by cloudmaster ( 10662 ) writes:
    
    For simplifying documents, I've found HTML::TreeBuilder to be a handy module. Then you just have to write code to simplify HTML (throw away useless tags, merge adjacent tags, etc), rather than worrying about reformatting word docs.
- - Re:OpenOffice 2.0-beta "save-as" and "export" grea (Score:3, Interesting)
    
    by sonamchauhan ( 587356 ) writes:
    
    Just expanding on your suggestion...
    
    Perhaps he could use the OpenOffice API to automatically have a server-side instance of OpenOffice open submitted Word documents and save them as HTML. This should happen at the same time the user uploads the document - that way the user could preview the conversion to HTML, and if it was flawed, he could choose to publish the document as PDF.
    
    OpenOffice API:
    http://api.openoffice.org/ [openoffice.org]
    
    Code snippet shows simplicity of converting OpenOffice Writer SXW document
Handy alternative to Notepad (Score:2)

by bigwavejas ( 678602 ) * writes:

I use a program called, "EditPlus" http://www.editplus.com/ [editplus.com] It has syntax highlighting for the most common extensions (HTML, CSS, PHP, ASP, Perl, C/C++, Java, JavaScript and VBScript).
What I basically do is paste the document into EditPlus, then I use a function called "Replace" to get rid of the big stuff and edit out the rest of the tags manually. It may not be the best solution, but it's visually easier than just using notepad.
- - Re:Handy alternative to Notepad (Score:2)
    
    by maotx ( 765127 ) writes:
    
    I've personally always enjoyed Scite. [google.com]
    Free, os, macros, cross compatible (Linux, Windows), and recognizes syntax from multiple languages.
PDF? (Score:2, Insightful)

by Anonymous Coward writes:

How about Word -> PDF -> HTML?

Just a thought ... and probably a dumb one.
Tedious cleanup? (Score:2, Funny)

by Timesprout ( 579035 ) writes:

Sounds like a job for Mrs FooAtWFU
PDF? (Score:3, Insightful)

by night_flyer ( 453866 ) writes: on Tuesday August 09, 2005 @05:52PM (#13282128) Homepage

you can either hot link to the .doc themselves in a new window or convert to .pdf and do the same thing

Share
twitter facebook
- Re:PDF? (Score:3, Funny)
  
  by cloudmaster ( 10662 ) writes:
  
  I hate you and your kind. Yes, hate. :)
- Re:PDF? (Score:3, Interesting)
  
  by ImaLamer ( 260199 ) writes:
  
  "Print to PDF" seems to be the function that would solve all of these problems, but so would any others. Think you *could* print to a TIFF, PDF, virtually any image type with a *nix Word compatible program - then you can scan the image and OCR it to plain text. Antiword (mentioned by another /.er: http://www.winfield.demon.nl/ [demon.nl]) can convert DOC to plain text... there are thousands of options.
  
  However, if someone is getting the idea for another open source project to solve this dilema then I'd suggest somethin
- Re:PDF? (Score:3, Informative)
  
  by FooAtWFU ( 699187 ) writes:
  
  Not quite what I'm looking for. Maybe I should clarify: I want to remove the nonessential formatting, while keep certain niceties (in particular, italics for the names of papers they reference, hyperlinks for footnotes, etc) and convert the rest into something simple and plain with just-the-basics of HTML, so I can then style it to match the other pages on the site. Many of these documents go to collections: encyclopedia articles, book reviews, abstracts of papers. If they don't look consistant, then people
  - - Re:PDF? (Score:3, Insightful)
      
      by FooAtWFU ( 699187 ) writes:
      
      The people we're dealing with here are not social sciences people, specifically economics. I'd be perfectly fine with taking DocBook or TeX documents- but nobody's going to send them. It's not happening. We accept Word documents because we have ALWAYS accepted Word documents and most countributors probably aren't even aware that something like TeX or Docbook even exists, let alone how to use it. And they're not willing to learn it just to send us stuff.
Use antiword (Score:2, Informative)

by shura57 ( 727404 ) * writes:

It takes Word file and spits out plain text. It can also do some more tricks.
- Re:Use antiword (Score:2)
  
  by uberdave ( 526529 ) writes:
  
  I tried it before, but it merely spit out keystrokes and mouse gestures.
  
  Oh wait! You're serious [demon.nl].
- Re:Use antiword (Score:3, Informative)
  
  by cloudmaster ( 10662 ) writes:
  
  He's obviously got access to Word, and saving as text in Word at least preserves most of the whitespace-based formatting. IIRC, antiword is mostly useful as a last-ditch effort to read a word doc, a step above piping the doc through "strings". :)
Dreamweaver (Score:5, Informative)

by necro2607 ( 771790 ) writes: on Tuesday August 09, 2005 @05:52PM (#13282131)

I would suggest using Macromedia Dreamweaver... it's what we use where I work and essentially all of our content entry involves pasting in content from Word documents supplied by clients. Dreamweaver is pretty good for formatting and working with stylesheets.

Share
twitter facebook
- Re:Dreamweaver (Score:5, Informative)
  
  by fean ( 212516 ) writes: on Tuesday August 09, 2005 @05:56PM (#13282172) Homepage
  
  in Dreamweaver, there's a command "Clean up MS Word HTML". Its made to clean up Word's crappy html, and does a pretty nice job of it.
  
  Parent Share
  twitter facebook
  - Re:Dreamweaver (Score:5, Informative)
    
    by Anonymous Coward writes: on Tuesday August 09, 2005 @07:36PM (#13282792)
    
    Also note that you have the ability to cut and paste formatted text from Word into the 'Design View' within dreamweaver and DW will automatically reformat the incoming text appropriately. In my brief test to make sure i wasnt talking out my a** i found it even supports word tables properly.
    If you paste text into the Code view, DW removes the formatting completely and just uses the raw text.
    
    Parent Share
    twitter facebook
- Re:Dreamweaver (Score:4, Funny)
  
  by drmike0099 ( 625308 ) writes: on Tuesday August 09, 2005 @06:16PM (#13282315)
  
  The OP is correct: 1) Open Dreamweaver. 2) Commands > Clean Up Word HTML... 3) Rejoice
  
  Parent Share
  twitter facebook
- - Re:Dreamweaver (Score:3, Informative)
    
    by ozbon ( 99708 ) writes:
    
    Yeah, I found that the "clean up word docs" did about 80% of the work, and then it was just a matter of a buttload of search/replace stuff in order to get it to finish the rest.
    
    Worked pretty well, once I'd got the search/replace stuff sussed out.
    
    Mind you, on a big word file you can think it's crashed when actually it's just doing lots of thinking...
- - Re:Dreamweaver (Score:3, Insightful)
    
    by necro2607 ( 771790 ) writes:
    
    We use Dreamweaver exclusively for HTML/CSS programming and it doesn't create code that doesn't work - in fact the code it creates is very very compatible: I've never had to manually tweak code to obtain proper design and layout with the sites I've built.
    
    Indeed we'd love to move to advanced CSS for page formatting but that's a big step right now - there are no professional WYSIWYG editors that have the sheer range and quality of features we need - Page templates, ability for clients to update the site later
Antiword (Score:3, Informative)

by alanp ( 179536 ) writes: on Tuesday August 09, 2005 @05:53PM (#13282135) Homepage

Try antiword, it's got a real decent HTML option.

Share
twitter facebook
Textism (Score:5, Informative)

by NoInfo ( 247461 ) * writes: on Tuesday August 09, 2005 @05:53PM (#13282137) Homepage Journal

Here's a tool I saw linked off of O'Reilly Radar once:

http://textism.com/wordcleaner/ [textism.com]

I used it once and it did a pretty decent job at preserving the tables. Yet if they're using anything odd like graphics or it's been incredibly tweaked, it probably won't be 100% perfect.

Share
twitter facebook
- Re:Textism (Score:4, Informative)
  
  by e**(i pi)-1 ( 462311 ) writes: on Tuesday August 09, 2005 @06:02PM (#13282222) Homepage Journal
  
  a standalone Perl script, I use daily is demoronizer [fourmilab.ch].
  
  Parent Share
  twitter facebook
  - Re:Textism (Score:3, Informative)
    
    by deepestblue ( 206649 ) writes:
    
    The demoronizer is b0rk3n. See http://www.unicode.org/faq/unicode_web.html#2 [unicode.org]
One suggestion (Score:4, Funny)

by Da Fokka ( 94074 ) writes: on Tuesday August 09, 2005 @05:53PM (#13282138) Homepage

You might consider a pack of monkeys and typewriters. They can ultimately reproduce Shakespeare so maybe, maybe they might be ablt to properly reformat the HTML gibberish Word produces.
Of course, you could also outsource to India but that's unethical to both the monkeys and the Americon economy.

Share
twitter facebook
- Re:One suggestion (Score:3, Funny)
  
  by Anonymous Coward writes:
  
  It's hard to find qualified monkeys - most of them already have jobs editing /. and cnn.com...
One Word... (Score:5, Funny)

by ScentCone ( 795499 ) writes: on Tuesday August 09, 2005 @05:53PM (#13282140)

..."Intern"

Share
twitter facebook
- Re:One Word... (Score:5, Funny)
  
  by Cerdic ( 904049 ) writes: on Tuesday August 09, 2005 @06:00PM (#13282198)
  
  No, no, no...
  
  Usually they are from academic users
  
  It sounds like this might be a university environment. The correct answer should be grad students .
  
  Parent Share
  twitter facebook
  - Re:One Word... (Score:2)
    
    by ScentCone ( 795499 ) writes:
    
    The correct answer should be grad students
    
    Or better yet, exchange students. The mythical tri-lingual graduate exchange students are, of course, ideal. They can work on the HTML and translate the content while they're at it. All your footnote are belong to us, etc.
PDF? (Score:2)

by nizo ( 81281 ) * writes:

Perhaps offer every document as a pdf (there are plenty of conversion tools out there, such as ps2pdf, which you can use after printing the document to a postscript file), as well as offer it in whatever format was sent to you?
OpenOffice.Org, but not HTML export (Score:2)

by KillerBob ( 217953 ) writes:

OpenOffice.Org supports the ability to export a document as PDF. As you probably know, PDF viewers are available for all mainstream OSes, including Linux, from Adobe themselves.

Unless you're dealing with content that has to be accessed or updated frequently, then PDF is the way to go.
- - Re:OpenOffice.Org, but not HTML export (Score:2)
    
    by Maxo-Texas ( 864189 ) writes:
    
    Yes that is what the article says- but like many users the question is why is he requiring HTML? It may be that using PDF would solve his problem even tho it does not answer his question.
Change the Model? (Score:2)

by Stanistani ( 808333 ) writes:

Could you provide forms on your website for your academic users to submit the information directly?
- Re: (Score:3, Interesting)
  
  by account_deleted ( 4530225 ) writes:
  
  Comment removed based on user account deletion
HTML Export (Score:3, Informative)

by electroniceric ( 468976 ) writes: on Tuesday August 09, 2005 @05:55PM (#13282160)

If you're using Office 2000, you can find the HTML filter here:

http://www.microsoft.com/downloads/details.aspx?Fa milyID=209ADBEE-3FBD-482C-83B0-96FB79B74DED&displa ylang=EN [microsoft.com]

I believe this functionality is built into later versions of Word.

Per the site, this produces simpler HTML with Office-specific tags removed. With that done, you could probably use a PERL script, and you might also try writing some Word macros or COM/VBA scripts that clean up the document from within Word.

Share
twitter facebook
- Re:HTML Export (Score:5, Informative)
  
  by Marxist Hacker 42 ( 638312 ) * writes: <seebert42@gmail.com> on Tuesday August 09, 2005 @06:05PM (#13282244) Homepage Journal
  
  Whew- I hoped I didn't have to post this 40 comments down in the thread. Yes, Office 2000 has the above tool- and Office 2002 or 2003 has it on the Save As menu. The option you want is "Web Page (filtered)|*.html". I saw an interview once with somebody on the Word development team, and he claimed that the original Save As HTML was built for passing Word Documents over the web- and never meant to be read by human beings as a web page at all. Web Page (filtered) cuts out all the extra shyte that Save As HTML used to put in for managing version controled updates and changing the font every bloody character- and builds a real web page.
  
  Parent Share
  twitter facebook
- Re:HTML Export (Score:2)
  
  by MemeRot ( 80975 ) writes:
  
  I second this suggestion. It does a great job of producing comprehensible html out of the normal word mess. Now why this isn't the default option I'm at a loss to explain.
  
  Dreamweaver and the like may do a better job, but if you don't want to buy a tool like that for the occassional content paste, then this technique eliminates at least 80% of the work.
Dreamweaver (Score:5, Informative)

by SlashChick ( 544252 ) * writes: <erica@noSpam.erica.biz> on Tuesday August 09, 2005 @05:55PM (#13282162) Homepage Journal

Check out Commands -> Clean Up Word HTML in Dreamweaver. it does a nice job of getting rid of extraneous tags. While you're at it, take a look at Commands -> Apply Source Formatting as well. This can be customized to your specifications in the preferences section, and automatically tabs out, adds newlines, and converts tags to lowercase where appropriate in the HTML document. Dreamweaver is the closest thing I know of to a program that "automatically" cleans up Word HTML.

Good luck!

Share
twitter facebook
HTML Tidy (Score:5, Informative)

by N8F8 ( 4562 ) writes: on Tuesday August 09, 2005 @05:56PM (#13282166)

Save the Word document as filtered HTML and pipe the HTML through HTML Tidy [sourceforge.net]. Nice clean HTML.

Share
twitter facebook
- Tidy Flags (Score:5, Informative)
  
  by N8F8 ( 4562 ) writes: on Tuesday August 09, 2005 @06:00PM (#13282206)
  
  Almost forgot. The Tidy Docs [sourceforge.net] will tell you to select "--bare" and "--word-2000" and I also recommend "--output-xhtml" and "--indent".
  
  Parent Share
  twitter facebook
  - Re:Tidy Flags (Score:2, Informative)
    
    by Bogtha ( 906264 ) writes:
    
    I also recommend "--output-xhtml"
    
    Why? XHTML isn't any better than HTML 4.01 for almost anybody, and it's less compatible.
wvWare (Score:2)

by dougmc ( 70836 ) writes:

wvWare [wvware.sou] is what you want.
It can programatically convert most Word documents into html documents, and does about as good a job as one could expect. And it makes better html than Word does itself.
- Re:wvWare (Score:2)
  
  by dougmc ( 70836 ) writes:
  
  I apparantly buggered the link. It's [sourceforge.net] http://wvware.sourceforge.net/ [sourceforge.net] .
  javester said that it's not being maintained anymore, but I do see some recent updates at the Sourceforge page [sourceforge.net]. If it's not being actively developed anymore, it would seem to be because it works pretty well now.
  In any event, I've found it to work very well for my use.
no link for you, Slashdot hordes! (Score:5, Informative)

by SeanTobin ( 138474 ) * writes: <byrdhuntr AT hotmail DOT com> on Tuesday August 09, 2005 @05:56PM (#13282174)

Hmmm... sounds like a challenge to me. Let's see what we can dig up.

Step 1: Let's look at his user page [slashdot.org]

Ahh! He put in a website with his profile. Let's all go and check out http://fennec.homedns.org/ [homedns.org]

Hmm... looks like a personal page. Not too sure what to make of the comic. Anyway, let's move on to..

Step 2: Let's look at his author [wfu.edu] page. Some interesting stuff here, including three separate e-mail addresses (which I won't post here. You're welcome :)

A-ha! There is a link to his employer! It's Economic History Services [eh.net]. And what do you know... there are a significant number of pages (especially under abstracts and book reviews) that seem to come straight out of a word processor, only with extensive cleaning. A quick look at the source reveals something interesting. It's clean. Very clean. We're talking on the level of I-use-vim-for-my-webpage-editor clean. Nice job.

Anyway, it looks like it was done by hand. I'm not saying its not good work (quite to the contrary), but I can see your need for an automated solution.

Share
twitter facebook
- Re:no link for you, Slashdot hordes! (Score:5, Funny)
  
  by Dunbal ( 464142 ) writes: on Tuesday August 09, 2005 @06:04PM (#13282237)
  
  Which only goes to show:
  
  There is NO WAY the slashdot effect can be avoided. Resistance is futile...
  
  Parent Share
  twitter facebook
  - Re:no link for you, Slashdot hordes! (Score:4, Informative)
    
    by FooAtWFU ( 699187 ) writes: on Tuesday August 09, 2005 @07:02PM (#13282604) Homepage
    
    My SSH connection to my server still lives; I think my task was accomplished well enough. :)
    
    Parent Share
    twitter facebook
- - Re:no link for you, Slashdot hordes! (Score:3, Interesting)
    
    by FooAtWFU ( 699187 ) writes:
    
    actually, I'm quite all right. At first I was a trifle worried when I saw that my machine's load was a little high and the story relatively new, but then I realized that it was just running pisg to generate channel statistics for #wikipedia [homedns.org]. It's a beefy server on a fast line, really; I don't anticipate any issues if I can hide way down in the comments page instead of in the fine summary...
  - Re:no link for you, Slashdot hordes! (Score:5, Funny)
    
    by jalefkowit ( 101585 ) writes: <jasonNO@SPAMjasonlefkowitz.com> on Tuesday August 09, 2005 @08:51PM (#13283182) Homepage
    
    This just seems like a perfect opportunity for the application of a little common sense along with just a hint of courtesy.
    
    You must be new here.
    
    Parent Share
    twitter facebook
  - Common... what? (Score:3, Funny)
    
    by DragonHawk ( 21256 ) writes:
    
    "... a perfect opportunity for the application of a little common sense..."
    
    What is this "common sense" of which you speak? Where may I download it from?
HTML Tidy (Score:3, Informative)

by John_Booty ( 149925 ) writes: <johnbooty@booty p r o j e c t . o rg> on Tuesday August 09, 2005 @05:59PM (#13282191) Homepage

HTML Tidy has a special mode for cleaning up Word's crappy HTML export. HTML Tidy is a free command-line tool that is also embedded in a lot of popular HTML editors.

HTML Tidy:
http://tidy.sourceforge.net/ [sourceforge.net]
HTML Kit (great integration with HTML Tidy; it includes HTML Tidy so you can just grab HTML Kit without grabbing HTML Tidy)
http://www.chami.com/html-kit/ [chami.com]

Countless other editors integrate with HTML Tidy as well. Have fun and good luck!

Share
twitter facebook
Get it in PDF first. (Score:3, Interesting)

by frostman ( 302143 ) writes: on Tuesday August 09, 2005 @06:01PM (#13282210) Homepage Journal

I'm assuming you have the right to republish the Word documents. I'm also assuming you have no control over how many Word-specific formatting features are used by the authors.

What I would do in your shoes is set up a (mostly) automated system to convert the Word files to PDF. You can buy Acrobat or you can go with a third-party, printer-driver-style converter, but in the end you'll probably save more headaches just using Acrobat.

Once you have a document in PDF, you can use any of the numerous (free and commercial) tools to convert that to HTML, text, whatever - all much more reliably than from Word directly. It's not perfect, but it's probably the closest you'll get.

Plus, you can post the PDFs themselves for download in case someone wants them - and at least Google will still happily index your PDFs.

Yes, you'll probably have to live with some NT variant to get that part done (though it might work with OSX) - but it's most likely your fastest path to *quality* conversions.

Share
twitter facebook
Homesite (Score:2)

by Hank Chinaski ( 257573 ) writes:

Homesite has a function to import and clean Word Documents.
HTML parsing (Score:2)

by ChiralSoftware ( 743411 ) writes:

If you really want to do it right, use an HTML parser to extract the content, and then re-render it. That's exactly what our mobile search [mwtj.com] engine does to convert web pages to mobile pages. It's non-trivial stuff. The advantage of doing it is that you do end up with clean, uniform HTML (or WML or XHTML in our case).
Some future version of Tomcat [apache.org] should have built-in content parsing in its filters so that filter writers could write simple filters to transform content in a meaningful way. But I haven't see
use kword (Score:2)

by 0xABADC0DA ( 867955 ) writes:

I had to do this recently, but to print a 20-page document as 12 pages by removing page breaks. I found that kword using the html export filter and setting it to "HTMl 4.01 + Light (strict xhtml)" was the best mode. This doesn't use the style sheets and just converts to basic html.... no fancy positioning or fonts, just some headers and basic styles. This was using kwork 1.4.1 / kde 3.4.2 btw.

Everything else I tried sucked, including OO.o's export.
Resign from your executive position (Score:3, Interesting)

by Fastball ( 91927 ) writes: on Tuesday August 09, 2005 @06:03PM (#13282227) Journal

What is it with executives and directors and their fixation with sending simple memos and messages via Word attachments in e-mails? Everybody else is on board with plain text (except some folks who are smitten with font coloring). Why can't the dolts at the top of the totem pole type in their mail client's editor and hit "Send?"

Share
twitter facebook
- Re:Resign from your executive position (Score:5, Insightful)
  
  by dougmc ( 70836 ) writes: <dougmc+slashdot@frenzied.us> on Tuesday August 09, 2005 @06:11PM (#13282281) Homepage
  
  Everybody else is on board with plain text
  
  I don't know where you live/work, but out here in the real world, not everybody is on board with plain text. Not anymore.
  I use mutt and fetchmail in a company of Exchange users. Almost every email I get at work now, from everybody, is in html. (Unless I sent it to myself.) I don't like it, but I deal with it. It's certainly easier to deal with it than to try and change everybody else.
  I could change jobs, but over something as trivial as html emails? No. I like my job, I like the people I work with, so I just bend like the reed in the wind ...
  Still, the executives are certainly worse about email ettiquette than most, and it's not just in this company -- everywhere I've worked I've found this to be the case. They don't include Subjects at all, or include useless ones like `message'. Some will type up a memo and send it as a .pdf file attachment, or worse as a .bmp file. They rarely trim anything when responding to a post -- they just top post away. (But many people do that ...)
  
  Parent Share
  twitter facebook
  - Re:Resign from your executive position (Score:3, Insightful)
    
    by khakipuce ( 625944 ) writes:
    
    I use mutt and fetchmail in a company of Exchange users
    isn't exactly "bending like a reed in the wind" -using a graphical mail client would be.
    Why do you use mutt and fetchmail? Why? Why? Why? Just about everywhere I have worked it has been easier (and often there is no choice) to just use what they use rather than trying to be clever or different. It is good to gain wide experience and it is good to have the flexibility to use the tools at hand.
- Amen (Score:5, Funny)
  
  by Quadraginta ( 902985 ) writes: on Tuesday August 09, 2005 @06:26PM (#13282357)
  
  Jesus, tell me about it. I get 30kb attachments merely saying "Got your email, thanks!" with "thanks" done up in some odd curly red font and a six-line sig, not to mention the twenty-seven 8x10 colored glossy JPG attachments with circles and arrows and a paragraph on the back of each one...
  
  Parent Share
  twitter facebook
- Re:Resign from your executive position (Score:5, Funny)
  
  by VGR ( 467274 ) writes: on Tuesday August 09, 2005 @07:05PM (#13282623)
  
  You think that's bad?
  
  I was given 61 screenshots (blithely dubbed "program requirements"), each its own Word document. Each containing only a (weirdly scaled) picture, of course.
  
  61 Word documents.
  
  Parent Share
  twitter facebook
- Re:Resign from your executive position (Score:3, Funny)
  
  by Detritus ( 11846 ) writes:
  
  Because it isn't a "real memo" unless it is printed on company letterhead, formatted according to the company's style guide.
- Re:Resign from your executive position (Score:3, Insightful)
  
  by advocate_one ( 662832 ) writes:
  
  because Microsoft put that "Oh, so convenient", send as email entry in the File menu of ms-word.... that's why... then the top of the totem pole dunderheads don't have to go to the trouble of firing up their email client and creating a message and then finding the file to attach it...
  and as another poster has suggested, perhaps it's the quality department to blame as a memo or whatever, isn't a real memo or whatever unless it has been created with the official approved template...
Use AbiWord on the command line (Score:2)

by dominator ( 61418 ) writes:

I'd recommend using the CVS version of AbiWord. It'll preserve almost all of your visual and semantic meaning using XHTML and CSS. This includes fairly complex things like endnotes, footnotes, tables, floating text boxes, etc.

AbiWord --to=file.html file.doc

http://www.abisource.com/ [abisource.com]
is HTML really necessary (Score:2)

by lakeland ( 218447 ) writes:

Most web users now seem to tolerate PDF files, and exporting from word to PDF is much more reliable than exporting from word to HTML.

As a solution goes, it is pretty crude. However, it works quickly and easly, and produces nice looking output.
- Yes! (Score:3, Insightful)
  
  by Kelson ( 129150 ) * writes:
  
  PDF is most suitable for documents that need to be printed with specific formatting.
  
  For documents that are going to be viewed online, it's infinitely preferable to use a free-form format like HTML (was designed to be) that can adjust to varying monitor and window sizes.
antiword (Score:2)

by pizza_milkshake ( 580452 ) writes:

antiword is nice
http://www.winfield.demon.nl/ [demon.nl]
Use a text editor (Score:2)

by chia_monkey ( 593501 ) writes:

That's what I did. Copy all the text in Word, paste it in a text editor (which kills all the formatting assuming you're not using RTF), copy that and paste it in your HTML editor (usually the same editor to code your HTML) or you can paste into Dreamweaver or similar and go that route. Quick and easy.
avoiding the hand-edit (Score:2)

by SethJohnson ( 112166 ) writes:

I'm seeing a lot of 'use Dreamweaver' responses that are well-meaning and probably will solve this guy's dilemna. But what about those of us running CMS systems with text area inputs in forms? Our content people copy-and-paste directly from word and these crazy MsWord entities get crudly transposed into ASCII question marks.

Anyone got a good regsub routine for correctly substituting these entities for their approximate ASCII equivalents? I'm just looking for pattern matching here... Don't need a bunch of
Webworks Pro (Score:2)

by 99BottlesOfBeerInMyF ( 813746 ) writes:

You may want tot look at WebWorks pro application for sanely exporting Word files as HTML/XML. I've used it in the past (a handful of years ago) and it was pretty reasonable. It is worth investigating in any case.
fckeditor (Score:3, Informative)

by mixmasterjake ( 745969 ) writes: on Tuesday August 09, 2005 @06:10PM (#13282274)

fckeditor [fckeditor.com] is an in-browser WSYWIG. It has a "Paste from MS Word" button that actually strips out a lot of the unecessary baggage. I don't know how well it handles embedded images or tricky layouts, but for the basic stuff it works well.

The interface is similar to Word - maybe if you're lucky, you could get some of your content producers to use it.

Share
twitter facebook
HTML Tidy program (Score:5, Informative)

by Todd Knarr ( 15451 ) writes: on Tuesday August 09, 2005 @06:10PM (#13282278) Homepage

One program I've had luck with is the HTML Tidy program at http://www.w3.org/People/Raggett/tidy/ [w3.org]. It seems to clean up code (particularly from Word) quite a bit.

Share
twitter facebook
PDF - GhostScipt (Score:2)

by Embedded Geek ( 532893 ) writes:

Some have suggested using PDFs. To do this, I use Ghostscipt and Ghostword. Here [oreilly.com] is a good description from O'Reilly's Word Hacks on how to install it in Word.
- Grrr... (Score:2)
  
  by Embedded Geek ( 532893 ) writes:
  
  That should be "Ghostsc r ipt," of course.
  I could really use a speling cheker.
WordML - FO - XHTML/PDF (Score:5, Informative)

by room101 ( 236520 ) writes: on Tuesday August 09, 2005 @06:12PM (#13282293) Homepage

Using a modern version of Word, output in WordML (xml format). Use a XSL stylesheet [antennahouse.com] to convert the WordML to FO (formatting objects).

From there, do anything you want, like XHTML or PDF.

Or just go to XHTML from WordML with some stylesheet. XSL is teh cool!

Share
twitter facebook
Recreating formatting? (Score:2, Insightful)

by RoadWarriorX ( 522317 ) writes:

To achieve any sort of visual consistency on the site these various formatting tags all need to be scrubbed, but even using other office suites with better HTML export (OpenOffice.Org) to do the dirty work, it's often easier to recreate the formatting by hand from a plain-text version than it is to clean up a sea of messy tags.

The problem with conversion of documents to HTML in general is the expectation that the formatting needs to be preserved. There have been times where I needed to "post" a document
Net-It is your magical tool (Score:4, Informative)

by netringer ( 319831 ) writes: <maaddr-slashdot@NospaM.yahoo.com> on Tuesday August 09, 2005 @06:16PM (#13282312) Journal

Net-It Central [netit.com] is the magical tool you were looking for. With that you can just point it at the file share with the Word Documents (and Excel and Power Point...) on it and see them indexed and cross linked on web pages. It'll update the content as the source docs change.

Oh, you mean non-commercial magical tools?

Share
twitter facebook
Does anyone have any...magical tools...? (Score:2)

by exp(pi*sqrt(163)) ( 613870 ) writes:

A computer?
Abiword (Score:2)

by forlornhope ( 688722 ) writes:

The abiword project has a set of command line utilities to convert word documents to various other formats. Its called wv in Ubuntu/Debian. The one you want is called wvHtml.
Pagify (Score:3, Informative)

by bckspc ( 172870 ) writes: on Tuesday August 09, 2005 @06:32PM (#13282390) Homepage

Pagify [backspace.com] is a perl script I wrote to do this for another job. It's basically a series of regular expressions that: 1. purges all the proprietary XML gunk from the HTML file you save from Word. 2. chops the file into smaller files wherever a Heading 1 appears 3. attaches endnotes as footnotes to the appropriate pages. It's GPL'd, so go nuts.

Share
twitter facebook
Try this.... (Score:5, Informative)

by mormop ( 415983 ) writes: on Tuesday August 09, 2005 @06:32PM (#13282391)

Demoroniser is, in the author's own man pages words:

A Perl script which corrects incompatible HTML generated by Microsoft applications. [fourmilab.ch]

You can get it from the link in the same page. I must confess that I've not used it myself (don't use Office/Frontpage) but if it does what it says on the tin it should sort you out.

Share
twitter facebook
Dreamweaver MX 2004 (Score:3, Informative)

by greymond ( 539980 ) writes: on Tuesday August 09, 2005 @06:45PM (#13282479) Homepage Journal

1) Open Word
2) Select All -> Copy
3) Open Dreamweaver
4) File -> New Html Doc
5) Paste
6) Commands -> Clean up Word Html
7) Commands -> Apply Source Formatting (if you take the time to set the programs preferences to what you like)
8) Done
9) Drink beer
10) Sleep

Share
twitter facebook
wvHtml (Score:3, Informative)

by itomato ( 91092 ) writes: on Wednesday August 10, 2005 @09:45AM (#13285657)

http://wvware.sourceforge.net/ [sourceforge.net]

From the sourceforge page:

wv Utilities

Provided with the wv distribution is an application called wvWare. wvWare is a "power-user" application with lots of command-line options, doo-dads, bells, and whistles. Less interesting, but more convenient, are the helper scripts that use wvWare. These are:

* wvHtml: convert your Word document into HTML4.0

(there are more utilities for LaTeX, etc..

I'm using this to convert all of our internal documentation. It does a pretty good job, even converts the images and acts in a relatively reliable manner with 2003, 2000, & 97 formatted files. There's some oddball output sprinkled in, but nothing a little sed fanciness can't fix.

Share
twitter facebook
- Re:hi (Score:2, Funny)
  
  by Anonymous Coward writes:
  
  hello. how are you?
  - Re:hi (Score:2, Funny)
    
    by Anonymous Coward writes:
    
    I'm fine thank you
    - Re:hi (Score:2, Funny)
      
      by Anonymous Coward writes:
      
      I'm fine too.
      
      I'm glad we have these little discussions. It makes my day so much more interesting.
      
      Let's do lunch.
- Re:You need an intern (Score:2)
  
  by 0xABADC0DA ( 867955 ) writes:
  
  So basically:
  
  1. Hire intern to transcribe work .doc into html
  2. Outsource html transcription job to india ...
  4. Profit!!
- Re:More specifically: Word into MS CMS (Score:2)
  
  by Alien Conspiracy ( 43638 ) writes:
  
  Can you edit the CMS server page so that it just has a regular textarea instead of anything fancy?
- Re:More specifically: Word into MS CMS (Score:3, Informative)
  
  by sbma44 ( 694130 ) writes:
  
  Are you using the telerik radeditor MCMS placeholder? It's free, and has capabilities that let you automatically strip out word formatting. In my experience it only sort of works... but it's better than nothing.
  
  You can also add an event handler for the updating event that does some regex tidying. Replacing the regex "]*>" will go a long way (better double-check that). You should be able to come up with a similar one for all the smarttag nonsense that gets inserted, too.
  
  Still, Word formatting remains
- Re:DEMORONISER (Score:3, Informative)
  
  by Kelson ( 129150 ) * writes:
  
  The Demoroniser was nice in its time, but it assumes the output should be 7-bit ASCII, or ISO Latin-1 at best.
  
  The Unmoroniser is an updated version that handles Unicode properly and will do things like convert proprietary Windows-only curly quotes to the appropriate HTML4 entities instead of dropping them back to less accurate, typographically offensive straight quotes. Same with ligatures and other characters that the Demoronizer would munge instead of convert.
  
  http://rheme.net/unmoroniser/ [rheme.net]
- Re:Export it as XML and XSLT it to HTML (Score:3, Informative)
  
  by Lemuridae ( 741647 ) writes:
  
  From the AC above:
  
  1) get a copy of Word 2003
  2) "save as" an exemplar as XML
  3) write an XSLT to render it in a HTML with stylesheets etc as appropriate to your website
  4) for every document you get, "save as" XML with the XSLT from 3) as the transformation.
  5) publish
  
  I've been wondering how long until using XSLT and XML was suggested. XML is supposed to be a common data transport format but most of the other comments talk about starting
- HTML Tidy cleans Word HTML. (Score:3, Informative)
  
  by Futurepower(R) ( 558542 ) writes:
  
  HTML Tidy cleans HTML, and has a special function for cleaning Word HTML junk.
  
  It must be terrible to work at Microsoft and always do mediocre work.
  
  --
  If you support dishonesty and violence [doonesbury.com], don't say you are Christian.
- Re:Convert to RTF first (Score:3, Interesting)
  
  by tonsofpcs ( 687961 ) writes:
  
  Yes, it is, since RTF is a text-based format where all the formatting is open and close tags, much like HTML. Save a word doc as rtf instead and open it in notepad, and you will see. There are many tools premade to convert from RTF to HTML, but you can build your own easily.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Scrapping (Score:5, Interesting)

Sounds like you should release on sourceforge (Score:5, Interesting)

Re:Sounds like you should release on sourceforge (Score:3, Insightful)

Re:Sounds like you should release on sourceforge (Score:3, Funny)

Actually, an NDA probably doesn't matter. (Score:3, Interesting)

Re:Actually, an NDA probably doesn't matter. (Score:5, Funny)

Re:Actually, an NDA probably doesn't matter. (Score:3, Insightful)

Trying this again (Score:3, Interesting)

Re:Scrapping (Score:3, Insightful)

Re:Scrapping (Score:3, Informative)

Re:OpenOffice 2.0-beta "save-as" and "export" grea (Score:3, Interesting)

Handy alternative to Notepad (Score:2)

Re:Handy alternative to Notepad (Score:2)

PDF? (Score:2, Insightful)

Tedious cleanup? (Score:2, Funny)

PDF? (Score:3, Insightful)

Re:PDF? (Score:3, Funny)

Re:PDF? (Score:3, Interesting)

Re:PDF? (Score:3, Informative)

Re:PDF? (Score:3, Insightful)

Use antiword (Score:2, Informative)

Re:Use antiword (Score:2)

Re:Use antiword (Score:3, Informative)

Dreamweaver (Score:5, Informative)

Re:Dreamweaver (Score:5, Informative)

Re:Dreamweaver (Score:5, Informative)

Re:Dreamweaver (Score:4, Funny)

Re:Dreamweaver (Score:3, Informative)

Re:Dreamweaver (Score:3, Insightful)

Antiword (Score:3, Informative)

Textism (Score:5, Informative)

Re:Textism (Score:4, Informative)

Re:Textism (Score:3, Informative)

One suggestion (Score:4, Funny)

Re:One suggestion (Score:3, Funny)

One Word... (Score:5, Funny)

Re:One Word... (Score:5, Funny)

Re:One Word... (Score:2)

PDF? (Score:2)

OpenOffice.Org, but not HTML export (Score:2)

Re:OpenOffice.Org, but not HTML export (Score:2)

Change the Model? (Score:2)

Re: (Score:3, Interesting)

HTML Export (Score:3, Informative)

Re:HTML Export (Score:5, Informative)

Re:HTML Export (Score:2)

Dreamweaver (Score:5, Informative)

HTML Tidy (Score:5, Informative)

Tidy Flags (Score:5, Informative)

Re:Tidy Flags (Score:2, Informative)

wvWare (Score:2)

Re:wvWare (Score:2)

no link for you, Slashdot hordes! (Score:5, Informative)

Re:no link for you, Slashdot hordes! (Score:5, Funny)

Re:no link for you, Slashdot hordes! (Score:4, Informative)

Re:no link for you, Slashdot hordes! (Score:3, Interesting)

Re:no link for you, Slashdot hordes! (Score:5, Funny)

Common... what? (Score:3, Funny)

HTML Tidy (Score:3, Informative)

Get it in PDF first. (Score:3, Interesting)

Homesite (Score:2)

HTML parsing (Score:2)

use kword (Score:2)

Resign from your executive position (Score:3, Interesting)

Re:Resign from your executive position (Score:5, Insightful)

Re:Resign from your executive position (Score:3, Insightful)

Amen (Score:5, Funny)

Re:Resign from your executive position (Score:5, Funny)

Re:Resign from your executive position (Score:3, Funny)

Re:Resign from your executive position (Score:3, Insightful)

Use AbiWord on the command line (Score:2)

is HTML really necessary (Score:2)

Yes! (Score:3, Insightful)

antiword (Score:2)

Use a text editor (Score:2)

avoiding the hand-edit (Score:2)

Webworks Pro (Score:2)

fckeditor (Score:3, Informative)

HTML Tidy program (Score:5, Informative)

PDF - GhostScipt (Score:2)