Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
Upgrades

Migration from PDF, MS Word and Frontpage? 10

l337hx0r asks: "I've just began work at a government department who, with 24 branches and five years at their disposal, has managed to create 85,000 proprietary documents. The current IT manager doesn't see a problem with this, and the government recommendations of open formats came as quite a surprise. I want to move this intranet away from WYSIWYG to a logical document structure (Docbook, Tex, XHTML), though the migration tools have been quite disappointing. Can it be done?"
This discussion has been archived. No new comments can be posted.

Migration from PDF, MS Word and Frontpage?

Comments Filter:
  • Yes. (Score:3, Insightful)

    by RGRistroph ( 86936 ) <rgristroph@gmail.com> on Tuesday September 11, 2001 @04:55AM (#2276754) Homepage
    Yes, it can be done.

    The easy part is converting and indexing all the docs. Not that that is easy. What I would do is something along the lines of scripts to convert them to html and put them in a database with a web browsable front end, building indexes of keywords, accompanied by A LOT of manual labour inserting meta-information about each document.

    Almost any document editor these days can import and export html.

    The hard part is getting people to start using it. They won't insert their new documents, they won't use it to efficiently look up stuff instead of poking around in their harddrives and email archives, they will just keep doing what got them in trouble in the first place.

    And the only thing you can do about it is get a new job.

    Hope that helps.
  • which government is recommending open standards, time for me to move there..
  • by Confused ( 34234 ) on Tuesday September 11, 2001 @10:59AM (#2277581) Homepage
    As a previous poster said, it can be done. Unfortunately, the biggest challenges aren't of technical nature. Getting conversion filters from all those formats to a new one isn't a problem, it's just a matter of money in the worst case.

    What becomes the real stumbling block in such initiatives is the user. As longs a the user begged, tempted and forced to produce all new documents in the new format, all work you do will be in vain.

    To get acceptance by all those civil servants, you'll have to

    - either accept MS-Word documents with excel objects and power point presentations added in them,

    - or transparently convert those kind documents into a format you like, all while trying to figure what relevant data is in there,

    - or force the user to use a system that produces properly formatted documents.

    The first solution will have you deal with atrocious documents, sometimes with every line terminated by paragraph break and tables formatted with spaces. To add insult to injury, you'll find in them power point objects, which include excel graphs, just because the user likes to add a little arrow pointing to a column. But at least user will not complain about the new guidelines, because they can happily go on producing their horrors.

    With second solution, converting the documents to a storage format, you'll end up with something like PDF, which just try to render the document to their foremat just to get rid of all the crap. PDF is well documented and it's quite easy to produce a document reader. This solution is usually acceptable, although you lose the document structure (for the few documents that had one in the first place).

    Forcing the user to produce well formed documents is the most challenging solution. You will basically alienate most of your customers and, if you don't have decades of experience in interdepartment infighting and enough power to enforce the new guidelines, you'll lose. On the other hand, if you had that experience and power, you wouldn't think about such a project.

    What works best is to provide a reasonable framework for the documents and everything else will get stripped. That way you take most creative potential from the clients and lock them in a most rigid frame. If you have the luck to get that framework sanctioned from very high authority (which means tons of unproductive meetings with uninterested and clueless people who have their own axes to grind), the users will bitch and moan, but they won't be able to escape. If that official approval misses, the framework will be ignored.

    You have a very stony road ahead, and depending on your capacity for frustration, you will give up sooner or later. And, when you give up, taking PDF as standard format isn't such a terrible choice. Been there, got burned. Now I'm older.

    Confused (failed crusader for structured documents)

    PS: I love LATeX and TeX, but trying to introduce it in a Microsoft oriented place is asking for disaster. You will just exchange unusable word document with unusable LATeX documents. At least the Word Documents don't have syntax errors.

    • I love LATeX and TeX, but trying to introduce it in a Microsoft oriented place is asking for disaster. You will just exchange unusable word document with unusable LATeX documents. At least the Word Documents don't have syntax errors.

      Did you try getting people to use LyX [lyx.org]? Why (not)? Did it work? Why (not)?

      Cheers //Johan

      • No, I didn't try Lyx, for the simple reason, that at that time, Lyx wasn't. This dates back quite some time, and we tried to replace MSWord for Windows 2.0 .

        Confused
  • We are having similar problems, most of the (unnamed) agency's documents are in Word and Wordperfect format, neither of which is terrible portable or searchable. Lots of documentation is rotting on random harddrives, or being lost of machines are surplused.

    The biggest problem is the dual government requirements to post most of these documents on the web, and at the same time make those webpages ADA compliant (which requires seperating content from presentation, and simplfying documents so they can be read by screenreader.) Neither Word or Wordperfect is capable of creating standards compliant HTML so the exporters are useless.

    Currently our "documentation management" system appears to consist of a few web developers who cut and paste ASCII text from documents and HTML format it.

    I offered to start an Open Source project that would be an XML based document database (possibly using the Star Office DTD as a base) figuring that more than a few organizations could use a tool like this. I received the "but if you leave we won't have a vendor to turn to! Where can we buy a support contract?" and "We don't do internal development, we outsource programming tasks" and "We can't afford to retrain everyone." (Ignoring the fact that they are retraining everyone from Wordperfect to Word as we speak.)

    I gave up.

  • by candot ( 513284 )
    Format migration is required by almost every commercial content- or knowledge-management system, and the more structured and metadata laden your content, the better as far as these products are concerned. But as other have pointed out, getting content into new formats is only part of the battle; then you have to put an interface on top of it, as well as reorient users to create content in the new system.

    It's a fight, but I haven't given up. I do a lot of consulting around structuring new approaches to information management and migrating content from old formats to new formats. Rather than look for another job, I've been looking for ways to change this one. One approach we've hit on lately is to stop beating ourselves silly migrating format A into highly structured format B. Instead, we leave things the way they are (if open format isn't important) or migrate to a minimally structured format, such as html or very loose xml. Then, instead of relying on a CMS to dictate a production system that produces well-formed documents, we develop systems that provide high-precision navigation across highly unstructured document sets. After all, for most companies, the point of all this work is to improve document access and navigation.

    If you've got a lot of content (and it sounds like you do), it's often better to put as little work in fixing past content than developing systems to handle future content. It's a lot cheaper, at least. The minimal conversion approach, combined with a good navigational overlay, can save a lot of time and money without compromising document access. Done properly, you can start creating new docs in the open format of choice, leave the old stuff alone, and actually improve document access in the process.

It is easier to write an incorrect program than understand a correct one.

Working...