Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
Data Storage Networking IT

How To Manage Hundreds of Thousands of Documents? 438

ajmcello78 writes "We're a mid-sized aerospace company with over a hundred thousand documents stored out on our Samba servers that also need to be accessed from our satellite offices. We have a VPN set up for the remote sites and use the Samba net use command to map the remote shares. It's becoming quite a mess, sometimes quite slow, and there is really no naming or numbering convention in place for the files and directories. We end up with mixed casing, all uppercase, all lowercase, dashes and ampersands in the file names, and there are literally hundreds of directories to sort through before you can find the document you are looking for. Does anybody know of a good system or method to manage all these documents, and also make them available to our satellite offices?"
This discussion has been archived. No new comments can be posted.

How To Manage Hundreds of Thousands of Documents?

Comments Filter:
  • by vondo ( 303621 ) on Wednesday June 10, 2009 @05:35PM (#28285707)

    I happen to have written one:

    http://sourceforge.net/projects/docdb-v/ [sourceforge.net]

    could be what you are looking for. Of course, it'll take effort to catalog the documents.

  • Documentum (Score:2, Interesting)

    by trondwn ( 1574085 ) on Wednesday June 10, 2009 @05:37PM (#28285749)
    use EMC document solution, where you have all documents i central database with metadata that can describe content. And can be accessed thru cached server from different sites.
  • Simple answer... (Score:1, Interesting)

    by Anonymous Coward on Wednesday June 10, 2009 @05:38PM (#28285763)

    Hire human beings to sift through it and label each file with a numbering/labeling system devised by your engineers. The human mind is a relatively inexpensive and already well designed piece of machinery. A few dozen of them given enough time can work through those hundreds of thousands of document and get them sorted correctly. The problem you have, is that you have unsorted, improperly labeled material. It is cheaper to hire sufficiently (or even insufficiently) evolved groups of people than to invent a machine capable of doing so. And, with the economy the way it is, you'll be doing everyone a favor by giving them years of employment. When the Manhattan project needed to create a large excess of fissile material for the war with Japan, and with all the men away at war, they hired dozens of women to sit at machines; turning knobs, checking meter levels, verifying output. The scientists themselves did not even need to be there, they designed a process and the women were trained in it and followed it.

  • FileNet (Score:5, Interesting)

    by Ohio Calvinist ( 895750 ) on Wednesday June 10, 2009 @05:50PM (#28285985)
    I worked at a place that used FileNet [filenet.com], which is now an IBM product, to do this sort of thing. We had millions of scanned documents in the system. I wasn't personally very impressed with it, in that whenever anything "bad" happened, you had to call IBM because finding support online was impossible, and at that they support wasn't very good. It was also a very picky system, those seemed to handle the load well. If you go with it, I strongly encourage doing it for UNIX/Oracle because it screamed "poorly ported" when we used it for Windows/MSSSQL. It has an API for integration, but it is also, poorly documented and would take some time to integrate into your existing business systems.

    This is more of a rant at this point, but it is a stop-gap solution that allows people to continue to use outdated business processes storing important data in image formats or in documents scattered about with minimal indexing/search capabilities, rather than analyzable "data" that can lead to "information." I always take the position that if the goal is something on paper, or the goal is to store something that "was" on paper, it is time to rethink the business process to see if we can automate it, or store/present the data electronically in the first place. The old school fights against it, but no one has ever been able to say it wasn't more efficent in the end and enabled IT to say "yes we can" when the next great idea came along versus "here is a stack of papers, figure out $trend."
  • by tlambert ( 566799 ) on Wednesday June 10, 2009 @05:56PM (#28286091)

    Who else read this and thought... working in a satellite office for an aerospace company would involve a lot of cool travel perks?

    -- Terry

  • by flydpnkrtn ( 114575 ) on Wednesday June 10, 2009 @06:01PM (#28286161)
    ...and I found an article [networkworld.com] backing up Alfresco pretty well:

    "You can now stand up an Alfresco Labs server next to a SharePoint Server, and Office will not be able to tell the difference between the two," said John Newton, CTO of Alfresco. "But we are offering considerably more scale than SharePoint can deliver," he said.
  • Alfresco of course! (Score:3, Interesting)

    by thule ( 9041 ) on Wednesday June 10, 2009 @06:14PM (#28286297) Homepage

    It can scale extremely well. It is the backend to Adobe's acrobat.com website! So you know it can handle millions of documents if you need it to. Sharepoint requires MS SQL Server for searching documents. With Alfresco, that feature is built in.

    Sharepoint is teaming software and not really designed for large document repositories. Alfresco has a teaming interface (Alfresco Share) and a more generic document repository interface.

    Alfresco can expose the repository via FTP, SMB, WebDAV, and a web client interface.

  • Re:SharePoint? (Score:5, Interesting)

    by moosesocks ( 264553 ) on Wednesday June 10, 2009 @06:15PM (#28286305) Homepage

    Why should you give sharepoint a chance? Even it it works well, it is proprietary and you are locked in.

    No less proprietary than other similar systems. Getting files in/out of Sharepoint is a fairly trivial process, and the API is open enough to craft your own migration plan if you ever decide to move away from it, given that everything else is equally (or even more) proprietary than Sharepoint.

    MS Office might be proprietary, but is so widespread that it's a 'standard' in its own right -- Sharepoint integrates excellently with Office, and keeps your users happy.

    I'm typically not one to advocate the use of Microsoft products. However, Sharepoint worked just fine when I was using it, and is definitely a huge step up from any of the competing products at the same price-level.

  • WIKI (Score:3, Interesting)

    by unum15 ( 695402 ) on Wednesday June 10, 2009 @06:15PM (#28286313) Homepage
    Maybe not the best solution for this particular job, but man am I glad we started using Dokuwiki for all our scattered documents.
  • Re:SharePoint? (Score:3, Interesting)

    by pete-classic ( 75983 ) <hutnick@gmail.com> on Wednesday June 10, 2009 @06:36PM (#28286513) Homepage Journal

    How does Sharepoint address his problem? It uses the exact same folder/file paradigm that is failing in his existing solution.

    -Peter

  • by jd ( 1658 ) <imipak@yahoGINSBERGo.com minus poet> on Wednesday June 10, 2009 @07:22PM (#28287007) Homepage Journal

    Very true. I'd take a look at DSpace [dspace.org] or Open Library [openlibrary.org] for examples of software designed to handle gigantic numbers of documents and maintain sensible indexes for them.

  • by Anonymous Coward on Wednesday June 10, 2009 @08:09PM (#28287383)

    I personally dealt with an issue like this at the Australian arm of large international mining equipment manufacturer. I wrote the software solutions mentioned and went on to do my engineering honors project in the area. My first recommendation is, stay away from document management systems, they are bulky, inefficient and tend to lock you into "their way" of doing things. As soon as you want something different, you will find yourself stuck. This is a simple problem don't make it too hard for yourself.

    My solution was multi-layered:
    1) Place exactly 1 person in charge.
    2) Enforce a naming convention. - Our CAD Drafters and Engineers (of which I did both) were notoriously bad at naming their documents correctly. Most of this was ignorance. Document your naming convention and make it well known.
    3) Write or come up with a standardized way of generating document numbers. In my current job as a software engineer I would recommend a simple, incremental numbered approach. Every document, every revision, simply gets a new number. Our engineers did not like this. So we went for a middle ground. Something like XXX-YYY-ZZ.eee Where XXX is the equipment type, YYY is the sub type, ZZ is the revision no, eee is the extension/file type.
    4) Standardize the way you store your documents. For instance, make a folder structure . C:\xxx\yyy\XXX-YYY-ZZ.eee
    5) Register ALL documents in a database with location, comments, purpose, revision, author name etc etc.
    6) Take the Draftsperson or the Engineer out of the archiving process. I wrote a utility that checks the a single "to be archived" folder, fixes obvious mistakes such as using "_" or "." instead of "-" and so on, checks the database to make sure that the document has been registered and then drops the into file system. Make the archive read only access for everyone except the person in charge (and any utilities of course).
    7) Clean up your existing archive. This can be a semi-automated process. I wrote a utility to do this partially, but it just takes a lot of painstaking effort. With 70,000 documents this was a slow and painful process but it can be done.
    8) STICK TO IT. Any exception will erode the system over time making it useless.

  • by vrmlguy ( 120854 ) <samwyse&gmail,com> on Wednesday June 10, 2009 @09:54PM (#28288301) Homepage Journal

    Now I actually LOL'd on that one!

    Getting our userbase to actually give a flying fart about a naming protocol and then getting them to follow it!?

    I won't be holding my breath for either of those two things to happen...

    You obviously don't know how to motivate people. Tell your boss you can get everything renamed for $100/week. Then post a leader board showing who has renamed the most documents each week, and give each week's winner a gift certificate to a local restaurant. Don't let anyone win more than once a month, to prevent too much disruption of normal job duties, and set up some sort of meta-moderation to prevent gaming the system. (You could probably use slashcode out-of the-box, just make each document a story and suggest better names in the comments.)

  • Contract a librarian (Score:1, Interesting)

    by Anonymous Coward on Wednesday June 10, 2009 @11:58PM (#28289099)

    I don't at all mean to be pat or facetious with such a short answer. But, seriously, you're asking the wrong crowd. Librarians have masters degrees in answering just the question you're asking and it goes far beyond just books. A couple of dozen hours of consulting contract with a good librarian can set you straight - whether you keep the samba store or you pony up for document management software. Because if you have a strategy for organizing your information and execute on it you will reap benefits that don't show up on any productivity spreadsheet. And a good librarian will tailor the system to how the people in your organization actually use the information. Get an internship program going with a library school to have someone remotely do the cleaning and maintenance every once in a while. Whole thing should be doable for a few grand.

  • by SlashWombat ( 1227578 ) on Thursday June 11, 2009 @01:14AM (#28289573)

    I agree - although you might want to eventually implement a systematic method of naming/storing your documents.

    While this seems like a good idea on the surface, it never seems to work very well. Even verbose file names seem to fail miserably, as the first 100 or so letters are always the same (IE:"Project Tiger Sausage rocket module assembly - Ion injector harware part 1 ...>

    Then there is the problem of getting all the employees to fully understand directory structures. Just look at your workmates screens to see how many people save everything on their desktop. (Yes, really a windows problem ... but so what.)

    I used to get a local WAN search engine, and let it index the entire site. Much more useful as it would find documents most people thought had disappeared years ago.

    Another approach would be to have a database that assigned file names for various projects and/or functions and mandate that this be the only way files are named for storage on the WAN. This, however, does not get around the thousands of files already stored in weird places using weird names! (Which is why an already indexed search engine works so well ... not only does it extract the file names, but also search on random (but significant) phrases are picked up within the scanned documents. (I used to use "MAMMA", it worked a treat!) http://www.mamma.com/

  • by Anonymous Coward on Thursday June 11, 2009 @01:22AM (#28289631)

    Why don't you tell us why EMC feels the need to spew out an entirely new platform every 6-12 months. Oh, that's right, cause their product is so wonderful.

    I used to be a Documentum developer. I've worked in everything from Workspace/Smartspace, through Smartspace Intranet, and into their Webtop/WDK platform.

    IMHO, their product is crap. And expensive crap at that. It solves problems very poorly without extensive customization, and customization is a painful exercise in learning their horribly written and poorly documented development "framework".

    Even worse, once you have things working the way you want, you can almost be guaranteed that your customizations won't work in the next release...which is probably going to happen in the next 2-3 quarters.

8 Catfish = 1 Octo-puss

Working...