Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
×
Microsoft

Converting Word Files to Text for Archiving? 81

Unknown Relic asks: "Our company has large quantities of old, MS Word documents which we are looking to permanently archive. One of the requirements of our archiving process is that the documents be stored in plain text format. Unfortunately we also have another, conflicting requirement: the text files must retain basic formatting information from the original documents, including bullets, indentations and basic table layout. While all of this formatting is possible using plain text, I have not been able to find any tools which do a decent job of retaining the above mentioned formatting during conversion. Even Word's 'Save As' option does a horrible job, though I suppose that's not overly surprising. Has anyone undertaken a project similar to this before? If so, what tools did you find or create to make the job feasible?"
This discussion has been archived. No new comments can be posted.

Converting Word Files to Text for Archiving?

Comments Filter:
  • have you tried (Score:4, Insightful)

    by gnixdep ( 629913 ) on Sunday December 08, 2002 @08:44PM (#4840842)
    Save As.. HTML? Its in plain text, but retains the formatting, as long as you don't get too exotic.
    • This is exactly what they need. No special characters, just plain ASCII test, that when viewed in a standard browser shows up as formated the way it was in word. The original post will need to find some way of converting the documents other than word though because word uses strange CSS extentions and proprietary fonts etc.
  • Dude. (Score:1, Funny)

    by Anonymous Coward
    You are seriously asking the impossible.

    If you want something automated that somehow preserves this information, you'd better find something that understands Word encoding 100%. Nothing besides Word can do that (though things like ClarisWorks comes close).

    The best you can do is go through the things by hand and transcribe them with whatever concocted plaintext encoding scheme you came up with.

    The other solution is to have a deep think about why you are abandoning Microsoft Word in the first place. Is it too difficult to keep one machine around with a copy of Windows for viewing these files?
    • They aren't abandoning word nesscessarily, they are archiving these particular files and they figure plain text is much more likely to stand the test of time then a propietary format.
      • I've been asked to do this sort of thing before and here's the advice I gave: trying to choose a format that will last the years is more dangerous than regularly maintaining your content in a modern format[1]. Keep it in XML for now, and every five years see where you think formats are going and move the data again. The idea of choosing a format now, being stuck with it, and hoping that when you need it you'll be Ok is the most dangerous and doomed scenario.

        As someone else said I would use XHTML and UTF-8 (ie, MSWord or OO.org's save to HTML, and then HTML tidy it). HTML is a well understood format, with readers on many platforms, and -unlike plain-text- maintains structure.

        [1] I don't mean bleeding edge modern, I mean some documented format such as Docbook, RTF, or XHTML.

    • Re:Dude. (Score:3, Funny)

      by MaggieL ( 10193 )
      You are seriously asking the impossible...Nothing besides Word can do that...The other solution is to have a deep think about why you are abandoning Microsoft Word in the first place.

      Mod parent up as funny, dude. Best MCSE parody I've seen in weeks.

      Seriously, dude.
  • anti-word (Score:4, Informative)

    by morgothan ( 201770 ) <`nat' `at' `morgothan.com'> on Sunday December 08, 2002 @08:49PM (#4840871) Homepage Journal
    The is a really nice application called antiword that strips a word file into a text file, Im not to sure if it will retain all the bullets and other crazy stuff. It should be work looking at though. You can check it out at http://www.winfield.demon.nl/index.html

    Well hope that helps.
  • PDF (Score:2, Informative)

    by ObviousGuy ( 578567 )
    I believe Adobe Acrobat can import Word documents and save them as PDF.
  • Try... (Score:5, Informative)

    by curunir ( 98273 ) on Sunday December 08, 2002 @08:51PM (#4840878) Homepage Journal
    this [microsoft.com] or this [ibm.com]
  • by metacosm ( 45796 ) on Sunday December 08, 2002 @08:54PM (#4840889)
    Short Answer: Good Luck! :)

    Long Answer:

    This was a few years ago, so it doesn't take into account new applications that could have totally changed the conversion landscape.

    I have tried to do something similar in the past for a company that does JUST conversations. We actually tried to get doc -> text with formatting, both with closed-source expensive applications and with open-source apps... not much success on either front.

    In the end, we basically gave up on it, and ended up making 2 versions of the document. A PDF version and a text version. The text version was easily searchable, and the PDF version looked great. Both ASCII and PDF are open standards, so that when they are phased out, we should be able to buy/write conversion tools to the next generation. We then wrapped a GUI around it so that when you searched, you searched the text, and you got results in PDF.

    The fact that PDF is an open-standard is important. When we did this, we used the text files for searching, but now-a-days, you can get lots of engines to search PDFs directly, so the text converstion may not even be needed.

    Sorry I don't have the solution you are looking for... honestly good luck. :)
  • This is by no means the easiest solution, but one that could work. I don't envy the position you're in.

    1. Create a web page (eg. ASP) on a server running MS Word. In your back end code, create an instance of MS Word and automate the conversion of these files to HTML. Pain in the ass, but this can be done.
    2. MS Word HTML comes out with all sorts of xml tags and crap, so use a simple regular expression to filter out all the tags you don't need, keeping <p>, <ol> etc etc.
    And yes:

    3. ???
    4. Profit!!
    • Dreamweaver has a nifty "strip MS-word type tags from document" you might find handy when converting to html. I would recommend converting to XHTML format actually, that way your files can be processed using an xml parser as well.
    • And you need to do this in ASP because? You can just as easily automate it from a VB app. One thing I have done in the past, which is a little off the wall, but works OK, is convert the Word doc to HTML and then use CDO and VB (or C# or VB.Net) to create an email message out of it. Assign the generated HTML to the HTMLBody, then pull the text version off the Body of the message and discard the email message. It does a nice job of converting the HTML to text. Bullets become asterisks with tabs, tables get tabbed out too I think. Try it out, it might give you what you want.
      • HTML is already text??? I understand there are reasons to turn html into regular old plain text but certainly not for archiving... The two issues are a format that will be around in 10yrs, html should do the trick, or xhtml and compression, either is text and should get an EXTREMELY high compression ratio.
  • http://www.winfield.demon.nl/index.html
  • Write a macro for Word that saves all open documents as RTF. Open all documents, run the macro. If that is possible, of course. Don't use MS office myself.
  • I don't know of a direct solution, but I would decompose this into two problems.
    1. Tranform your documents into a reasonable format, XML, or very simple HTML: no page layout, only tables and lists.
    2. Transform those files into text.
    The first part is difficult as you have to filter the data and remove a lot of unneeded information, one possible way to do this would be to convert word documents into RTF, and then RTF into HTML (tools to do this, like RTFTOHTML [slashdot.org]).

    The second part is quite easy. If your data is XML, you can convert it using simple scripts (tables might be an issue) if the data is simple HTML, you could could use Lynx to convert it into text with some layout.

    In fact, I would keep both version of the data handy. Having a somehow strctured version of data never hurts, and text files do not take so much space.

  • wvWare (Score:5, Informative)

    by Alethes ( 533985 ) on Sunday December 08, 2002 @09:06PM (#4840943)
    wvWare [wvware.com] has a library and a set of utilities. I use this all the time to convert Word attachments to HTML so I can read them.

    wvHtml: convert your Word document into HTML4.0

    wvLatex: convert your Word document into visually (pretty) correct LaTeX

    wvCleanLatex: convert into 'cleaner' LaTeX containing less visual mark-up, more suitable for further use and LyX import. Work in progress

    wvDVI: converts word to DVI. Requires 'latex'

    wvPS: converts word to PostScript. Requires 'dvips'

    wvPDF: converts word to Adobe PDF. Requires 'distill' from Adobe [Someone do a pdflatex or pdfhtml version :-)]

    wvText: converts word to plain text. Textually correct output requires 'lynx.' For poor output, this doesn't require anything special.

    wvAbw: converts word to Abiword format. (Far better just to use Abiword.)

    wvWml: converts word to WML for viewing on portable devices like WebPhones and Palm Pilots.

    wvRtf: a basic version exists

    wvMime: can be plugged as a MIME helper application into your browser/mail client; presents the document on-screen inside GhostView, while all intermediate files generated go into the /tmp directory.
    • Re:wvWare (Score:3, Informative)

      by sohp ( 22984 )
      I'll second wv, formerly know as MSWordView. I've used it since before the name change and have been satisfied. According to the blurb at freshmeat [freshmeat.net],
      wv (formerly known as MSWordView) is a library that understands the Microsoft Word 2000, 97, 95 and 6 file formats (".doc"), and is able to convert Word documents into HTML, which can then be read with a browser. It also allows other programs access to Word documents for the purpose of converting them to other formats (like RTF, PostScript, and PDF), and is currently being used by Abiword as its word importer.


      If by chance you have any Java around, the POI HDF APIs [apache.org] are great for manipulating that Horrible Document Format.
  • by SN74S181 ( 581549 ) on Sunday December 08, 2002 @09:25PM (#4841037)
    One of the requirements of our archiving process is that the documents be stored in plain text format.


    There's one simple answer: uuencode.

    *ducks*
  • by joe094287523459087 ( 564414 ) <joeNO@SPAMjoe.to> on Sunday December 08, 2002 @09:32PM (#4841066) Homepage
    i think all these answers are missing an essential part of your requirement, which is to keep in place the "advanced" formatting like indentations, italicization, bullets

    i worked at a company that created all their docs (1100 of them) in word and wanted them to be ported to pagemaker or the like. however, at that time there was no way to do the conversion so we needed to convert them to text first. we basically had the same problem you have.

    we ended up using a Word macro to convert all the word styles (font style, indents, bullets) to plaintext equivalents (tabs, _underline_, xbullets) and then saved as text. that worked nicely and i think it's the only way to accurately preserve word-specific styles and formatting.
  • Stellent Outside In (Score:4, Informative)

    by divbyzero ( 23176 ) on Sunday December 08, 2002 @09:40PM (#4841107) Journal
    Stellent (formerly INSO) Outside In [stellent.com] is a popular commercial solution to the problem. It is an SDK for conversion of lots of document types (including Word) into plaintext, HTML, or XML. Its allows you to control how much of the formatting is preserved, and in what manner. It's not perfect, not crash-proof, and not free, but it might do the trick in a corporate situation, especially when wrapped with a watchdog process. My company has had a lot of success with it.
  • The thing I don't understand is why you even need to do this in the first place? How many documents are we talking about here? I know .doc is quite inefficient compared to ASCII text, but unless you have so many documents that storage is a problem why not just leave them the way they are?

    Even if storage is a problem, compare the cost of your time to the cost of a few CDRs (or tapes, or whatever)
    • Re:Why? (Score:4, Informative)

      by Unknown Relic ( 544714 ) on Sunday December 08, 2002 @10:50PM (#4841410) Homepage
      I wasn't going to get into the details of specifically why this was needed, but since a large number of the posts have been asking the question, I will explain. An AC actually indirectly hit the nail on the head, though he was accusing me of fraud in the process: "If you are seriously interested in archiving, you should save off the files in a non-updatable format" For legal reasons, our company is required to maintain a permanent archive of some specific documents. In this case, permanent does not mean ten or even twenty years, but one hundred. This being the case, the archiving is actually being done to an analogue medium. If you are curious, we are using a type of microfilm which is specifically designed to be readable for up to 500 years. Microfilm also has the added benefit of being relatively unalterable once developed, unlike a digital file. The need for doc to text conversion comes from the hardware which performs the actual archiving. It only accepts text files or specially formatted TIFF images as input. The text files are not the actual format in which the documents will be archived, but they are a necessary intermediate step in the archiving process.
      • What I would do is use word to "print" the files using a postscript printer driver . Use the "save to file option" in the printer driver. Then convert the .ps files to tiffs . Details left as an "exercise for the reader"
        Word scripts and macros also left as an "exercise for the reader"
      • If a TIFF file will work, then why not some kind of 'print to fax' setup?

        Basic idea is:
        Word: print->fax program
        fax program saves as TIFF file (perhaps choosing 'fine' resolution etc.)

        Same idea with a different implementation would be to print to a 'postscript' printer, then use a postscript to tiff conversion program. This could be fairly well automated with a print queue on a *nix box accepting the postscript 'print' jobs, then simply archiving the resulting tiff files off someplace handy.
        • Sounds good, one possible snag is if he cannot use multi=page TIFFs, although doubtless there exists some little open source snippet to convert a multi page TIFF into lots of single pages.
        • Even better:

          There's an inexpensive shareware program called "FinePrint" which works as a fake printer driver between your applications and your actual printer. This program was originally created to save you paper, by printing up to eight pages on one side of one sheet, but over time it has gained a lot of nice features.

          One of the features that I've used for my online transactions reciepts is the option to Save the print job(s) to a file or a bunch of files. It will save in the following formats:

          • fp (FinePrint)
          • bmp
          • emf
          • jpg
          • tif
          • txt

          When I save as a TIF, I am given the option of monochrome, 4-bit, 8-bit, or 24-bit, and a resolution 72-1200 or "custom". There is a checkbox for "Create a separ4ate file for each page", so that solves the multi-page problem as well.

          I really like this program, as I've been a registered user since nearly the beginning, and I've used nearly all of it's features at one time or another. And I don't get anything for mentioning it either, so =P

          FinePrint [fineprint.com]

          • There's an inexpensive shareware program called "FinePrint" which works as a fake printer driver between your applications and your actual printer.
            I meant to have the word "driver" at the end of this sentence.
      • In that case, there seem to be two possible approaches, depending on how much formatting is to be retained:

        1 - Print to Postscript from MS Word, and use Ghostscript to convert the PS pages into TIFF files for imaging and archival.

        2 - Print to Text Only from MS Word, and send the resulting text files for imaging and archival.

        I would go with (1), simply because it takes less skilled review of the work (simple match/no match, instead of "how close is it").

        Ratboy
  • Word & Lynx (Score:3, Interesting)

    by wdr1 ( 31310 ) <wdr1@p[ ]x.com ['obo' in gap]> on Sunday December 08, 2002 @09:47PM (#4841147) Homepage Journal
    Use MS Word to save it as HTML, then run it though lynx -dump to save it as text.

    Although, you may want to give strong consideration to another poster's recomendation of using PDF. (Particularly since you care about formatting.)

    -Bill
  • by Anonymous Coward
    This has got to be one of the dumbest ideas I've heard. This sounds like my former boss, "I want to save this as TEXT FORMAT but I want to keep all the FONTS AND FORMATTING!!!"

    I think the best solution would be to tell them to piss off, and quit.

    Otherwise, save everything as HTML. That's text, and it'll retain enough shit to make your loser bosses happy.

  • I know its not .txt but its rich text format, its basically .txt but supports most features of a .doc in world so all your formatting remains the same. Find out if you can, it will make your like a whole lot easier.
  • by ez76 ( 322080 )
    % man strings
  • by wfrp01 ( 82831 ) on Sunday December 08, 2002 @10:54PM (#4841428) Journal
    use any suitable postscript printer driver, and print to a file.

    you don't need to buy any proprietary software. you will retain all formatting. you can view using free software on virtually any platform you like.

    this will easily work for any other document type you may have also, so you can have one standard archival format for anything and everything. big CADD file? no problem.

    bonus tip. archive or no archive, make sure your documents include text (in the footer, say) that indicates the location and filename of the document. so when you want to work backwards from paper to file, you know where to look. the print date is important, too.

    i don't know why i'm not using caps today...

    • Please tell me you aren't using the whole liter of soda, thats way too light of a mix for just a fifth, 1:1 is a bare minimum, I'd suggest 1.25:1.
      • Good point. I'm a beer man, myself. But I had a few scotch and sodas the other week, and they were pretty damn good. Haven't made them myself, though.

        I don't think you can buy liters of scotch. What are two common sizes of each that would be proportionaly correct, and consumable in a single sitting?
  • Setup a text printer (Score:5, Informative)

    by mhesseltine ( 541806 ) on Sunday December 08, 2002 @10:57PM (#4841445) Homepage Journal
    1. Add a Generic/Text only printer attached to a file port.
    2. Open file in Word
    3. Select print
    4. Select a file location and name
    5. Enjoy your new text file.

    For a variation on this, pick a good color Postscript printer, do the same print to file, then use Ghostscript to convert the PS to PDF

    • by jsse ( 254124 )
      That's not exactly the poster is asking, because the doesn't want to click the menu for all document. However, it's the best solution because other converters suggested above could not transform the doucment so perfectly, as the transformation it's part of the Word itself.

      May be he could write macro to automate the process?
      • From jsse:

        That's not exactly the poster is asking, because the doesn't want to click the menu for all document. However, it's the best solution because other converters suggested above could not transform the doucment so perfectly, as the transformation it's part of the Word itself. May be he could write macro to automate the process?

        I couldn't agree more. I'd hate to open each file, select print, and enter a filename. However, I also agree that VBA is probably geared toward this exact problem.

    • The above solution worked at my company - except we needed to convert them to PDF so we sent them to a PDF-converting printer.

      The trick is to set up a macro to print all the documents in a folder/whatever. The easiest way is the following:

      1) Create a new word document. You have to store the Macro somewhere, and a word document is simplest. Call the document "ConvertDocumentsToText".

      2) Add some instructions in the actual text of the document (Notice how you can combine the documentation with the code! ) Generally, the instructions include:
      How do you start the macro
      Where do the files need to be?

      3) Start writing the macro. (Alas! I don't have my Word macro handy, so the best I can do is pseudo-code.):

      Dim ofsFileSys as New FileSystemObject
      Dim theFolder as Folder
      Dim theFile as File
      Dim theDocument as Word.Document

      set theFolder = ofsFileSys.getFolder("c:\mypath")
      for each theFile in theFolder.files
      theDocument =open(theFile, _lots of parameters to the open call.)
      theDocument.print
      theDocument.close
      next

      3) That's pretty much it.

      4) Be careful with TWO issues:
      Don't use background printing - use foreground printing. Otherwise the print function kicks off a background thread and you'll get thousands of documents open and the machine will grind to a halt.

      Also, be careful with the parameters to the open command. We forced the open even if there were conversion questions (otherwise you get a dialog box that you have to click ).

      -Peter
  • If ur intention is to keep ur documents archived and searchable (FTR'd), then i would recomend an archiving system(DMS). You dont have to go through the hassale of coonverting ur documents, and at the same time they would be indexed and therby searchable.
    Most archiving systems support MS office documents, PDF's, HTML's etc..

    I guess this would save u the time and effort of converting ur documents(it might cost u more though:), but hell, money is always avalable isnt it,lol)
  • Mariner Software [marinersoftware.com] released DropDoc, which is based on the GPL'ed wvWare [wvware.com] libraries. It converts Word documents to .rtf, which maintains most of the basic formatting of a Word document.

    I have used it and it works fairly well.
  • I can absolutely understand the urge to archive documents in a simple, open, searchable format. I can also understand wanting to preserve basic formatting. But there's no need to stick entirely to plain old ASCII. Yes, you'll be able to read it on any system for the conceivable future, but you're missing out on a lot of benefit which can be gained from adding a small amount of complexity.

    RTF is what I'd recommend. It's an open format, like PDF. Word supports RTF very well and other posters have mentioned automated, third-party apps which can convert all of your Word files into RTFs. Have you ever opened an RTF document in 'vi' or (god forbid) Notepad? It's all there in plain ASCII text. It's just yet another mark-up convention (HTML might do just as well).

    My only reservation about going all RTF is that it may actually retain *too much* detail (such as embedded images and fancy fonts). Now, if you aren't opposed to that, then RTF is definitely for you. But, if you really just want plain old ASCII, I'd recommend listening to the other posters who mentioned using both ASCII text and PDF/PostScript or using a multi-step conversion process such as Word to HTML to ASCII text (although I've found Word's HTML export to be lacking).

    Good luck!
    Jason
  • ^ | up there somewhere the original poster said the reason he needs the conversion. He is using propietary hardware THAT requires text (or a tiff) as an intermediate format that it then puts on microfilm. Enough to the pdf, rtf, etc would be a better solution. text and tiff are the only solutions for this guy.
  • Using the Word Object Model [microsoft.com] you can programatically manipulate Word documents as you choose.
  • OpenOffice (Score:3, Insightful)

    by aminorex ( 141494 ) on Monday December 09, 2002 @01:32AM (#4842101) Homepage Journal
    I've had very good luck using OpenOffice's save As.

    There are a lot of options, however. I wonder what
    Google uses?

    One option would be saving as postscript and using
    a postscript-to-text extractor.

    But the best thing would be to relax the POT require-
    ment to allow XML. XML is so trivial to parse that
    even if XML itself falls from grace with technological
    advance, the few simple tags you need will certainly
    be supportable with 10-15 minutes of scripting
    (if the computers aren't smart enought respond
    to your voice command and retrieve the old XML
    specs and interpret them well enough to perform
    the transformation automatically at that future point
    -- which seems unlikely, given that you use only
    the most simple and basic of XML idioms.)

  • I recently discovered Lotus Notes Word format (.SAM) is text based just like XML. You could use this, or better go for StarOffice/OpenOffice native formats or SGML.
  • Worth a look, jakarta project called: POI [apache.org]: Poor obfuscation Implementation, referring to the quality of file archiving by Msft Office.
    Take a look at HDF "horrible document format" which handles MS Word files.
  • Simple, convert them into openoffice format. Now you have XML files that retain formatting, but may be searched and stored like text files.
  • You could try DOC2TXT [mbn.or.jp].

    HTH

  • We had a similar problem back in 1998 at a client site. Basically, we had to strip out all of the word macros embedded in a document. And there were tons of them! What we ended up doing is creating a Visual Basic program that : Opens a directory Gets a list of files Instantiates Word within VB and opens one of the documents. Runs a macro, to reformat Uses a save as Closes the Word Document with a Goes to the next file Almost any command (may be all, I never checked ;)) that you can perform in Word can be performed through a VB program remotely. Don't present dialog boxes...They'll want keystroke input. And hide the Word Window...it will require O/S overhead that really doesn't help you. Note: After running the conversion, take a sample set of documents and validate the conversion. I can't remember where we got the math, but it was like sqrt(n + 1), where n was the number of documents. We converted thousands of documents automatically between Word 95 and Word 97 this way.

Thus spake the master programmer: "After three days without programming, life becomes meaningless." -- Geoffrey James, "The Tao of Programming"

Working...