Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
Communications The Internet

Sanely Moving from Word to the Web? 547

FooAtWFU asks: "I have a job for a web site (no link for you, Slashdot hordes!). A lot of it is systems administration and development, but I have to routinely post content which comes from a myriad of other sources. Usually they are from academic users, come in Word format, and ultimately need to be posted in HTML. The problem is that Word has all sorts of tricks up its sleeve to throw off the font, layout, size, and so forth. To achieve any sort of visual consistency on the site these various formatting tags all need to be scrubbed, but even using other office suites with better HTML export (OpenOffice.Org) to do the dirty work, it's often easier to recreate the formatting by hand from a plain-text version than it is to clean up a sea of messy tags. Does anyone have any advice (or magical tools) to help me deal with this sort of tedious cleanup?"
This discussion has been archived. No new comments can be posted.

Sanely Moving from Word to the Web?

Comments Filter:
  • PDF? (Score:2, Insightful)

    by Anonymous Coward on Tuesday August 09, 2005 @05:52PM (#13282124)
    How about Word -> PDF -> HTML?

    Just a thought ... and probably a dumb one.
  • PDF? (Score:3, Insightful)

    by night_flyer ( 453866 ) on Tuesday August 09, 2005 @05:52PM (#13282128) Homepage
    you can either hot link to the .doc themselves in a new window or convert to .pdf and do the same thing
  • Re:Scrapping (Score:3, Insightful)

    by dougmc ( 70836 ) <dougmc+slashdot@frenzied.us> on Tuesday August 09, 2005 @06:05PM (#13282241) Homepage
    Use Regex (and similar functions) to strip off and/or replace specific tags/wordings (similar to web scrapping technique).
    Of course, you really can't properly parse html just using regular expressions. You can get it right 90% of the time relatively quickly, and a day or so work will get you 5% more, but you could spend weeks trying to get that last 5% -- and never quite get it.

    It's really better to use things that other people have made for parsing html. For example, if you use perl (and you should -- it's the ideal tool for this), HTML::Parser works pretty well, though there's a signifigant learning curve in using it.

  • by dougmc ( 70836 ) <dougmc+slashdot@frenzied.us> on Tuesday August 09, 2005 @06:11PM (#13282281) Homepage
    Everybody else is on board with plain text
    I don't know where you live/work, but out here in the real world, not everybody is on board with plain text. Not anymore.

    I use mutt and fetchmail in a company of Exchange users. Almost every email I get at work now, from everybody, is in html. (Unless I sent it to myself.) I don't like it, but I deal with it. It's certainly easier to deal with it than to try and change everybody else.

    I could change jobs, but over something as trivial as html emails? No. I like my job, I like the people I work with, so I just bend like the reed in the wind ...

    Still, the executives are certainly worse about email ettiquette than most, and it's not just in this company -- everywhere I've worked I've found this to be the case. They don't include Subjects at all, or include useless ones like `message'. Some will type up a memo and send it as a .pdf file attachment, or worse as a .bmp file. They rarely trim anything when responding to a post -- they just top post away. (But many people do that ...)

  • by RoadWarriorX ( 522317 ) on Tuesday August 09, 2005 @06:14PM (#13282303) Homepage
    To achieve any sort of visual consistency on the site these various formatting tags all need to be scrubbed, but even using other office suites with better HTML export (OpenOffice.Org) to do the dirty work, it's often easier to recreate the formatting by hand from a plain-text version than it is to clean up a sea of messy tags.

    The problem with conversion of documents to HTML in general is the expectation that the formatting needs to be preserved. There have been times where I needed to "post" a document to a web site, and I always try to get the author(s) to not worry about formatting. Formatted documents are pure evil simply because 9 times out of 10 it does not affect the relevant information that you are trying to convey to your audience. Sometimes, the authors give me grief about it, but I simply show them the possibilities of separating the content and presentation during the translation. I convert their documents to generic HTML (with whatever tools are available) and use CSS to apply relevant formatting for the type of document (a report, article, thesis, or whatever). No funky font tags, or weird tables. Just the let the HTML flows as it's meant to be.
  • by Anonymous Coward on Tuesday August 09, 2005 @06:49PM (#13282509)
    Since they were produced at work, the copyright on them is probably owned by the company and not by him.

    Plus the templates are probably in-house templates and thus would be useless outside of the company.
  • Yes! (Score:3, Insightful)

    by Kelson ( 129150 ) * on Tuesday August 09, 2005 @06:56PM (#13282566) Homepage Journal
    PDF is most suitable for documents that need to be printed with specific formatting.

    For documents that are going to be viewed online, it's infinitely preferable to use a free-form format like HTML (was designed to be) that can adjust to varying monitor and window sizes.
  • Re:Dreamweaver (Score:3, Insightful)

    by necro2607 ( 771790 ) on Tuesday August 09, 2005 @07:00PM (#13282592)
    We use Dreamweaver exclusively for HTML/CSS programming and it doesn't create code that doesn't work - in fact the code it creates is very very compatible: I've never had to manually tweak code to obtain proper design and layout with the sites I've built.

    Indeed we'd love to move to advanced CSS for page formatting but that's a big step right now - there are no professional WYSIWYG editors that have the sheer range and quality of features we need - Page templates, ability for clients to update the site later in a very convenient WYSIWYG interface, high compatability with ultra-common web media such as Flash, etc. etc...

    Trust me we're keeping our eyes peeled for a better solution but right now Dreamweaver is the best available. Sometimes simply sticking to "standards" isn't neccesarily the best idea. In fact, sticking to proper standards creates sites that differ in appearance from browser to browser. Dreamweaver has very impressive awareness of inconsistencies and standards-deviation in many browsers.
  • Re:PDF? (Score:3, Insightful)

    by FooAtWFU ( 699187 ) on Tuesday August 09, 2005 @08:22PM (#13283042) Homepage
    The people we're dealing with here are not social sciences people, specifically economics. I'd be perfectly fine with taking DocBook or TeX documents- but nobody's going to send them. It's not happening. We accept Word documents because we have ALWAYS accepted Word documents and most countributors probably aren't even aware that something like TeX or Docbook even exists, let alone how to use it. And they're not willing to learn it just to send us stuff.
  • by advocate_one ( 662832 ) on Wednesday August 10, 2005 @01:29AM (#13284259)
    because Microsoft put that "Oh, so convenient", send as email entry in the File menu of ms-word.... that's why... then the top of the totem pole dunderheads don't have to go to the trouble of firing up their email client and creating a message and then finding the file to attach it...

    and as another poster has suggested, perhaps it's the quality department to blame as a memo or whatever, isn't a real memo or whatever unless it has been created with the official approved template...

  • Re:PDF? (Score:2, Insightful)

    by anthony_dipierro ( 543308 ) on Wednesday August 10, 2005 @07:06AM (#13284930) Journal

    Do you really think the extra bandwidth costs would be more than the cost of his salary while coming up with a better solution?

  • by khakipuce ( 625944 ) on Wednesday August 10, 2005 @07:56AM (#13285087) Homepage Journal
    I use mutt and fetchmail in a company of Exchange users
    isn't exactly "bending like a reed in the wind" -using a graphical mail client would be.

    Why do you use mutt and fetchmail? Why? Why? Why? Just about everywhere I have worked it has been easier (and often there is no choice) to just use what they use rather than trying to be clever or different. It is good to gain wide experience and it is good to have the flexibility to use the tools at hand.

  • by Evil Grinn ( 223934 ) on Wednesday August 10, 2005 @08:02AM (#13285108)
    You realize he was converting them for the purpose of putting them on a website, right? ; )

    Did he say the website was public?

He has not acquired a fortune; the fortune has acquired him. -- Bion

Working...