Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
×
News

Open Source Document Management and Revision Control? 15

Ramon M. Felciano asks: "I'm trying to scrape together our development intranet and want to be able to post specs, change reports, etc. that could be in Word, HTML, PDF, etc. Are there any intranet "shells" out there that include at least rudimentary document management facilities (like the ability to post files to a directory structure, receive notifications of new postings, and allow single-click-to-open-inline functionality, rather than download and then open). I've played with the idea of converting everything to PDF but there doesn't seem to be a way to automate this from the source format. Also, integration with CVS for version controlling these docs would be a nice bonus I checked Freshmeat but didn't see any good matches. The closest was Phorum and the other thread / discussion servers, but we don't really need discussion support. Any suggestions? "
This discussion has been archived. No new comments can be posted.

Open Source Document Management and Revision Control?

Comments Filter:
  • Hi,

    I've been thinking about this for a longt ime now, and have found some things,but no real integrated solution as of yet.
    I will list the ideas I had here, in the hope someone will take them and build an open source application of it, or maybe I'll get some feedback and will start myself:

    -Posting documents:
    -- Can be done using a webdav enabled server, but I don't know wether it has version control and update messages.
    -- CVS could serve as a tool, but from things like MS Office it just doesn't seem too handy
    -- Have someone update/maintain a database and have the files on a fileserver, with their filelocations in the database. This is up to now the best idea, if you can afford the person to handle the document archiving. This is also limited to finished documents

    - Document formats:
    -- PDF seems good, as it has alot of open source tools to create them and to search them (see searching later on). There are tools to convert everything to pdf, and it has things like copyright protection. IMHO the document format
    -- XML , could be cool, open and easy format, just not many wysiwig tools to create them. Office 2000 is said to create xml docs, but I haven't checked the format out.

    - Searching(2 options as I see it)
    -- In the catalogueing plan one can easily search the database for the documents
    -- In other cases , look at umbsearch or ht:dig, who both search files, and can be enhanced by using parsers for pdf and Word documents.

    comments at the above email address


  • I have similar needs and have been thinking about this problem for quite some time. I even submitted it to 'Ask Slashdot' but it wasn't lucky enough to make it.

    OK. I have no intranet shells yet, but here is my take on the overall infrastructure.

    Version control

    It can be either CVS or ClearCase or any other decent version control software. I prefer CVS due to its free nature.

    Content

    All our legacy documents are in M$Word format and switching to something new is out of question. That said, I would like to phase in a XML/XSL based solution. Take DocBook for example and store (almost) all new documents in XML/DocBook.

    Presentation

    A small CGI based (or a Zope) frontend can easily serve files from CVS through the web server. If the doc is in binary format (Word, Excel, ...) there isn't much the CGI program can do. But if it's in XML it can be presented in HTML or served in PDF, etc.

    Searching

    The free indexing tools can help here, altough I am not aware of tools capable of indexing Word docs. But Word can be converted to HTML (search freshmeat for it) and while the output is quite ugly, it does not matter for indexing purposes.

    Problems

    On the server side we need a rock solid XSLT engine and a XSL formatter. The latter is a problem at the moment.

    On the client side a nifty, preferably cross platform XML editor is needed for the users to create and edit XML docs. This editor should buzzword compatible, validating, maybe WYSIWYG.

    These are the missing links. I decided to wait for a while with the implementation until XML and its friends (XSL, XSLT) mature a bit. But even now one can start coding the front end. I am thinking about creating a Zope product for this.

  • searching can be done with my 2 mentioned search engines, ht:dig and umbsearch, who can both use catdoc to read/index Word documtents.
    The same could be used in a cgi script to read word docs, and there are some free word document readers out there as well, their capabilities differing from Word version to Word version.
    I was thinking of a php front end, but Zope sounds ok too.
  • by GleekOid ( 6959 )
    BSCW is a web-based collaboration tool written in python. It's free for educational/non-commercial use, and very inexpensive for companies. My startup has been using BSCW successfully for exactly what you want, including posting word docs and other things, change notifications, versioning of documents, etc.

    See http://bscw.gmd.de/

  • Can you expound on this a bit? I've looked at BSCW a few times now and couldn't figure out how it handled versioning of documents... Thanks. Ramon

  • Just as a point of reference, MSOffice 2000 does generate XML, but the only viewers that I know of are ActiveX applets that require a MS Office licence to run...

    I've looked at the XML Office kicks out, and while I have no idea if it's compliant, it seems simple enough to be almost human readable. Does anyone know of a good open source converter for MS XML (to HTML or maybe a Java viewer)?
    --
  • MS Office's XML output is mostly just HTML with CSS. This allows it to be displayed in a normal browser. Then layered on top of that in different XML namespaces are further structuring information that allows you to read it back into Word without losing any information - kind of like an RTF for XML. Things like structured graphics actually go into the file using MS vml tags.

    So converting to HTML would just be a matter of exporting just the HTML namespace bits. Dead easy with perl's XML::Parser:

    use XML::Parser;
    XML::Parser->new(Namespaces=>1,Style=>'Stream')- >parsefile($ARGV[0]);
    sub StartTag {
    print if $_[0]->namespace eq 'html-ns-here';
    }
    sub EndTag {
    print if $_[0]->namespace eq 'html-ns-here';
    }
    sub Text {
    print if $_[0]->namespace eq 'html-ns-here';
    }

    Have fun!
  • Granted I haven't check in a long time, but last time I looked at http://elib.cs.berkeley.edu/ [berkeley.edu] I though it was on the right track for super-collaberative stuff across the net...
    The formats are probably whack, but the idea is incredible.

    Java interface (last time I checked) and I think the source is downloadable.

    Probably not what you want, but cool none the less.

    willis.

  • MS Office's XML output is mostly just HTML with CSS.

    That's true if you publish static HTML from Office, but if you save an "interactive" version, you get something that looks like this (from excel):

    <td class=xl15 width=64 style='width:48pt'>One</td> <td class=xl15 width=64 style='width:48pt' x:num="1"></td> <td class=xl15 width=64 style='width:48pt' x:num="2" x:fmla="=B1+1">

    As you can see, it looks dead easy to interpret, but right now the only interpreter I know of is the Office ActiveX controls.
    --
  • I will have to solve a problem similar to this in the near future. I plan to write all of the documents in LaTeX, and write Makefiles in the various directories that will use latex2html to generate a web readable version, or by a different make command, generate pdf or ps. I will have everything under cvs. I intend that access for posting will be through ssh and cvs -- you can set the CVS_RSH environment variable to tell cvs to use ssh to access a remote repository.

    On the server I will occasionally do a "cvs up; make html" in the web directory. I could also have a cgi that would do the same; but anyone who can post will have an ssh-reachable account and can do it themselves, so it would be a possibly insecure convienence.

    People who use Word will be SOL; there will be helpful links to Mac and Windows versions of LaTeX and emacs. It won't make a difference in the long run; they won't ever write anything in LaTeX, and they'll start using Word and at first email, later other servers, to exchange documents. But I believe that you have to fight the good fight; also, given the amount of stuff I have to write, I think that the only way to keep my sanity and any social life at all is to have a good LaTeX template doc and use it for everything.

    Let's review how this would meet your needs:

    --ability to post files to a directory structure: if you are thinking of something point-and-clicky like FrontPage, then you can look at some of the front ends or wrappers to cvs. I use pcl-cvs.el. There are some girly java and tcl/tk things also.

    --receive notifications of new postings: in cvs you can set a "cvs watch". Or the makefiles can blast a message if something is remade when they run. I don't intend to have any sort of automatic notification, because it seems like spam.

    --single-click-to-open rather than download and open: Well, I would browse the document library by checking out a copy from cvs, but I suppose others could look at the web and click on the html versions

    The main problem is that cvs doesn't handle binary format files very well, so Word is kind of out. Oh well. I think that the other people on this project will never use LaTeX and eventually they will put the frontpage extensions on the apache web server, but I'll always be able to say I tried to do the right thing from the start.
  • Why XML/XSL ? Is it purely a matter of not wanting Word due to the binary format, and looking for another format that has idiot-style editors, or can be exported from Word ? Or does this offer something I'm missing ?

    Under "presentation", you mention a frontend to serve files from CVS through the web server. Doesn't it make more sense to just have the static web directory, and do a cvs update on a regular basis, say with a cron job ?
  • I want XML because it is pure content. It can be processed by applications to produce something else. And it can be searched more intelligently than most other formats. For instance I can say 'Gimme all the docs that reference this other doc.' or 'Gimme all docs that have a section with a title "Overview" in it and that section has a subsection called "Python".

    Yes, it can be done and probably makes a lot of sense for performance reasons. But I would like to allow the users to retrieve old versions.
  • Err, that's still just plain XHTML with an extra namespace (marked x: there) for some additional information so that a round trip out to XML and back in won't lose any information. Not a good DTD design for XML output (i.e. it's really just HTML but they get to use the buzzwords without lying!), but it is what it is :)

"More software projects have gone awry for lack of calendar time than for all other causes combined." -- Fred Brooks, Jr., _The Mythical Man Month_

Working...