Converting Word Files to Text for Archiving? 81
Unknown Relic asks: "Our company has large quantities of old, MS Word documents which we are looking to permanently archive. One of the requirements of our archiving process is that the documents be stored in plain text format. Unfortunately we also have another, conflicting requirement: the text files must retain basic formatting information from the original documents, including bullets, indentations and basic table layout. While all of this formatting is possible using plain text, I have not been able to find any tools which do a decent job of retaining the above mentioned formatting during conversion. Even Word's 'Save As' option does a horrible job, though I suppose that's not overly surprising. Has anyone undertaken a project similar to this before? If so, what tools did you find or create to make the job feasible?"
have you tried (Score:4, Insightful)
Re:have you tried (Score:1)
Re:have you tried (Score:2)
The format isn't, but Times New Roman and friends are.
Re:have you tried (Score:2)
fonts -- and in fact, fonts can't be copyrighted [typeright.org]!
While this may change in the future, due to intense
lobbying efforts by special interests and the general
prevailing culture of intellectual property grants --
I will not say "rights" because it is a devaluing abuse
of the word -- the constitutional proviso precluding
ex post facto [cornell.edu] law insures that all currently existing
fonts will never be copyrighted (within the
U.S.).
Re:have you tried (Score:2)
And few cloned fonts have the same spacing/characters as the fonts they're trying to clone. Actually, most of MS's fonts are "clone fonts", but they differ from the Adobe originals significantly.
Re:have you tried (Score:2)
Even limiting yourself to MS Fonts, there are still a lot
in circulation -- including all the important ones -- from
distributions that occurred before the font EULAs were
introduced.
Re:have you tried ... HTML Tidy (Score:4, Insightful)
Re:have you tried ... HTML Tidy (Score:2)
Good Luck
Re:have you tried ... HTML Tidy (Score:2)
Dude. (Score:1, Funny)
If you want something automated that somehow preserves this information, you'd better find something that understands Word encoding 100%. Nothing besides Word can do that (though things like ClarisWorks comes close).
The best you can do is go through the things by hand and transcribe them with whatever concocted plaintext encoding scheme you came up with.
The other solution is to have a deep think about why you are abandoning Microsoft Word in the first place. Is it too difficult to keep one machine around with a copy of Windows for viewing these files?
Re:Dude. (Score:1)
Re:Dude. (Score:1)
I've been asked to do this sort of thing before and here's the advice I gave: trying to choose a format that will last the years is more dangerous than regularly maintaining your content in a modern format[1]. Keep it in XML for now, and every five years see where you think formats are going and move the data again. The idea of choosing a format now, being stuck with it, and hoping that when you need it you'll be Ok is the most dangerous and doomed scenario.
As someone else said I would use XHTML and UTF-8 (ie, MSWord or OO.org's save to HTML, and then HTML tidy it). HTML is a well understood format, with readers on many platforms, and -unlike plain-text- maintains structure.
[1] I don't mean bleeding edge modern, I mean some documented format such as Docbook, RTF, or XHTML.
Re:Dude. (Score:3, Funny)
Mod parent up as funny, dude. Best MCSE parody I've seen in weeks.
Seriously, dude.
anti-word (Score:4, Informative)
Well hope that helps.
PDF (Score:2, Informative)
Re:PDF (Score:2)
Re:PDF (Score:1)
Try... (Score:5, Informative)
Having worked a similar problem... (Score:5, Informative)
Long Answer:
This was a few years ago, so it doesn't take into account new applications that could have totally changed the conversion landscape.
I have tried to do something similar in the past for a company that does JUST conversations. We actually tried to get doc -> text with formatting, both with closed-source expensive applications and with open-source apps... not much success on either front.
In the end, we basically gave up on it, and ended up making 2 versions of the document. A PDF version and a text version. The text version was easily searchable, and the PDF version looked great. Both ASCII and PDF are open standards, so that when they are phased out, we should be able to buy/write conversion tools to the next generation. We then wrapped a GUI around it so that when you searched, you searched the text, and you got results in PDF.
The fact that PDF is an open-standard is important. When we did this, we used the text files for searching, but now-a-days, you can get lots of engines to search PDFs directly, so the text converstion may not even be needed.
Sorry I don't have the solution you are looking for... honestly good luck.
Re:Having worked a similar problem... (Score:1)
Re:Archiving? Are you sure? (Score:3, Insightful)
Convoluted Suggestion #1 (Score:2)
3. ???
4. Profit!!
Re:Convoluted Suggestion #1 (Score:1)
Re:Convoluted Suggestion #1 (Score:1)
Re:Convoluted Suggestion #1 (Score:2)
Antiword is a free MS Word reader (Score:1)
RTF (Score:1)
First HTML (Score:2)
The second part is quite easy. If your data is XML, you can convert it using simple scripts (tables might be an issue) if the data is simple HTML, you could could use Lynx to convert it into text with some layout.
In fact, I would keep both version of the data handy. Having a somehow strctured version of data never hurts, and text files do not take so much space.
Re:First HTML (Score:1)
wvWare (Score:5, Informative)
wvHtml: convert your Word document into HTML4.0
wvLatex: convert your Word document into visually (pretty) correct LaTeX
wvCleanLatex: convert into 'cleaner' LaTeX containing less visual mark-up, more suitable for further use and LyX import. Work in progress
wvDVI: converts word to DVI. Requires 'latex'
wvPS: converts word to PostScript. Requires 'dvips'
wvPDF: converts word to Adobe PDF. Requires 'distill' from Adobe [Someone do a pdflatex or pdfhtml version
wvText: converts word to plain text. Textually correct output requires 'lynx.' For poor output, this doesn't require anything special.
wvAbw: converts word to Abiword format. (Far better just to use Abiword.)
wvWml: converts word to WML for viewing on portable devices like WebPhones and Palm Pilots.
wvRtf: a basic version exists
wvMime: can be plugged as a MIME helper application into your browser/mail client; presents the document on-screen inside GhostView, while all intermediate files generated go into the
Re:wvWare (Score:3, Informative)
If by chance you have any Java around, the POI HDF APIs [apache.org] are great for manipulating that Horrible Document Format.
simple answer. (Score:5, Funny)
There's one simple answer: uuencode.
*ducks*
Re:simple answer. (Score:2)
use a word macro first (Score:4, Informative)
i worked at a company that created all their docs (1100 of them) in word and wanted them to be ported to pagemaker or the like. however, at that time there was no way to do the conversion so we needed to convert them to text first. we basically had the same problem you have.
we ended up using a Word macro to convert all the word styles (font style, indents, bullets) to plaintext equivalents (tabs, _underline_, xbullets) and then saved as text. that worked nicely and i think it's the only way to accurately preserve word-specific styles and formatting.
Stellent Outside In (Score:4, Informative)
Re:Stellent Outside In (Score:2)
Re:Stellent Outside In (Score:2)
Why? (Score:1)
Even if storage is a problem, compare the cost of your time to the cost of a few CDRs (or tapes, or whatever)
Re:Why? (Score:4, Informative)
Re:Why? (Score:2)
Word scripts and macros also left as an "exercise for the reader"
Re:Why? (Score:2)
Basic idea is:
Word: print->fax program
fax program saves as TIFF file (perhaps choosing 'fine' resolution etc.)
Same idea with a different implementation would be to print to a 'postscript' printer, then use a postscript to tiff conversion program. This could be fairly well automated with a print queue on a *nix box accepting the postscript 'print' jobs, then simply archiving the resulting tiff files off someplace handy.
Re:Why? (Score:2)
Re:Why? (Score:2)
There's an inexpensive shareware program called "FinePrint" which works as a fake printer driver between your applications and your actual printer. This program was originally created to save you paper, by printing up to eight pages on one side of one sheet, but over time it has gained a lot of nice features.
One of the features that I've used for my online transactions reciepts is the option to Save the print job(s) to a file or a bunch of files. It will save in the following formats:
When I save as a TIF, I am given the option of monochrome, 4-bit, 8-bit, or 24-bit, and a resolution 72-1200 or "custom". There is a checkbox for "Create a separ4ate file for each page", so that solves the multi-page problem as well.
I really like this program, as I've been a registered user since nearly the beginning, and I've used nearly all of it's features at one time or another. And I don't get anything for mentioning it either, so =P
FinePrint [fineprint.com]
Re:Why? (Score:2)
Re:Why? (Score:1)
1 - Print to Postscript from MS Word, and use Ghostscript to convert the PS pages into TIFF files for imaging and archival.
2 - Print to Text Only from MS Word, and send the resulting text files for imaging and archival.
I would go with (1), simply because it takes less skilled review of the work (simple match/no match, instead of "how close is it").
Ratboy
Word & Lynx (Score:3, Interesting)
Although, you may want to give strong consideration to another poster's recomendation of using PDF. (Particularly since you care about formatting.)
-Bill
Don't change the files, Change your Managers (Score:1, Funny)
I think the best solution would be to tell them to piss off, and quit.
Otherwise, save everything as HTML. That's text, and it'll retain enough shit to make your loser bosses happy.
.rtf the way to go (Score:1)
hmm (Score:1)
print to postscript (Score:3, Insightful)
you don't need to buy any proprietary software. you will retain all formatting. you can view using free software on virtually any platform you like.
this will easily work for any other document type you may have also, so you can have one standard archival format for anything and everything. big CADD file? no problem.
bonus tip. archive or no archive, make sure your documents include text (in the footer, say) that indicates the location and filename of the document. so when you want to work backwards from paper to file, you know where to look. the print date is important, too.
i don't know why i'm not using caps today...
OT: Your Sig (Score:1)
Re:OT: Your Sig (Score:1)
I don't think you can buy liters of scotch. What are two common sizes of each that would be proportionaly correct, and consumable in a single sitting?
Re:OT: Your Sig (Score:1)
Setup a text printer (Score:5, Informative)
For a variation on this, pick a good color Postscript printer, do the same print to file, then use Ghostscript to convert the PS to PDF
Re:Setup a text printer (Score:2, Informative)
May be he could write macro to automate the process?
Re:Setup a text printer (Score:2)
From jsse:
I couldn't agree more. I'd hate to open each file, select print, and enter a filename. However, I also agree that VBA is probably geared toward this exact problem.
Re: The Macro to help out the printing (Score:2)
The trick is to set up a macro to print all the documents in a folder/whatever. The easiest way is the following:
1) Create a new word document. You have to store the Macro somewhere, and a word document is simplest. Call the document "ConvertDocumentsToText".
2) Add some instructions in the actual text of the document (Notice how you can combine the documentation with the code! ) Generally, the instructions include:
How do you start the macro
Where do the files need to be?
3) Start writing the macro. (Alas! I don't have my Word macro handy, so the best I can do is pseudo-code.):
Dim ofsFileSys as New FileSystemObject
Dim theFolder as Folder
Dim theFile as File
Dim theDocument as Word.Document
set theFolder = ofsFileSys.getFolder("c:\mypath")
for each theFile in theFolder.files
theDocument =open(theFile, _lots of parameters to the open call.)
theDocument.print
theDocument.close
next
3) That's pretty much it.
4) Be careful with TWO issues:
Don't use background printing - use foreground printing. Otherwise the print function kicks off a background thread and you'll get thousands of documents open and the machine will grind to a halt.
Also, be careful with the parameters to the open command. We forced the open even if there were conversion questions (otherwise you get a dialog box that you have to click ).
-Peter
Archiving systems - Document Mangment Systems(DMS) (Score:1)
Most archiving systems support MS office documents, PDF's, HTML's etc..
I guess this would save u the time and effort of converting ur documents(it might cost u more though:), but hell, money is always avalable isnt it,lol)
On the Mac... (Score:2)
I have used it and it works fairly well.
RTF *is* text (Score:1)
RTF is what I'd recommend. It's an open format, like PDF. Word supports RTF very well and other posters have mentioned automated, third-party apps which can convert all of your Word files into RTFs. Have you ever opened an RTF document in 'vi' or (god forbid) Notepad? It's all there in plain ASCII text. It's just yet another mark-up convention (HTML might do just as well).
My only reservation about going all RTF is that it may actually retain *too much* detail (such as embedded images and fancy fonts). Now, if you aren't opposed to that, then RTF is definitely for you. But, if you really just want plain old ASCII, I'd recommend listening to the other posters who mentioned using both ASCII text and PDF/PostScript or using a multi-step conversion process such as Word to HTML to ASCII text (although I've found Word's HTML export to be lacking).
Good luck!
Jason
Stop with the other formats!!! (Score:2)
Word Object Model... (Score:1)
OpenOffice (Score:3, Insightful)
There are a lot of options, however. I wonder what
Google uses?
One option would be saving as postscript and using
a postscript-to-text extractor.
But the best thing would be to relax the POT require-
ment to allow XML. XML is so trivial to parse that
even if XML itself falls from grace with technological
advance, the few simple tags you need will certainly
be supportable with 10-15 minutes of scripting
(if the computers aren't smart enought respond
to your voice command and retrieve the old XML
specs and interpret them well enough to perform
the transformation automatically at that future point
-- which seems unlikely, given that you use only
the most simple and basic of XML idioms.)
The choices are obvious (Score:2)
I recently discovered Lotus Notes Word format (.SAM) is text based just like XML. You could use this, or better go for StarOffice/OpenOffice native formats or SGML.
POI (Score:1)
Take a look at HDF "horrible document format" which handles MS Word files.
OpenOffice is your friend (Score:1)
Re:OpenOffice is your friend (Score:2)
Perhaps this will do: DOC2TXT (Score:1)
HTH
Use VB automation... (Score:1)