Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
×
Data Storage Networking IT

How To Manage Hundreds of Thousands of Documents? 438

ajmcello78 writes "We're a mid-sized aerospace company with over a hundred thousand documents stored out on our Samba servers that also need to be accessed from our satellite offices. We have a VPN set up for the remote sites and use the Samba net use command to map the remote shares. It's becoming quite a mess, sometimes quite slow, and there is really no naming or numbering convention in place for the files and directories. We end up with mixed casing, all uppercase, all lowercase, dashes and ampersands in the file names, and there are literally hundreds of directories to sort through before you can find the document you are looking for. Does anybody know of a good system or method to manage all these documents, and also make them available to our satellite offices?"
This discussion has been archived. No new comments can be posted.

How To Manage Hundreds of Thousands of Documents?

Comments Filter:
  • by Anonymous Coward on Wednesday June 10, 2009 @05:30PM (#28285613)

    http://en.wikipedia.org/wiki/Hummingbird_Ltd

    and

    http://connectivity.hummingbird.com/home/connectivity.html?cks=y

  • Google Appliance (Score:5, Informative)

    by TornCityVenz ( 1123185 ) on Wednesday June 10, 2009 @05:31PM (#28285635) Homepage Journal
  • by Hognoxious ( 631665 ) on Wednesday June 10, 2009 @05:34PM (#28285663) Homepage Journal

    The lack of a naming convention for the filenames and directories is neither here nor there. What matters is how well it's indexed.

    Now I use naming conventions for my files (photos ,mp3s etc). Am i contradicting myself? No, it's because I don't have enough of them that I need a separate index.

  • by Swampash ( 1131503 ) on Wednesday June 10, 2009 @05:35PM (#28285691)
  • SharePoint? (Score:5, Informative)

    by tekiegreg ( 674773 ) * <tekieg1-slashdot@yahoo.com> on Wednesday June 10, 2009 @05:36PM (#28285711) Homepage Journal
    I know I'm gonna get hit for blurting out the Microsoft Solution but...give SharePoint a shot...
  • by Anonymous Coward on Wednesday June 10, 2009 @05:38PM (#28285759)

    Sounds like you need a real document management system.

    Depending on your requirements, you could go with something open source like Alfresco or one of the big boys like EMC Documentum or IBM/Filenet P8. Either way, you will end-up with an indexed repository of documents that makes it easy to to find old documents, add new ones, etc (assuming you and/or your integrator do the project correctly). It will also provide a web front-end so you don't have as much killer WAN traffic as you do now.

    With a good document management system in-place, you are also on your way to having a workflow and other benefits as well. e.g. When Bob submits a document with XYZ as an index value, automatically tell Joe that it is in and ask Joe to approve it. When Joe approves it, tag it "Approved", and let Jim know.

    Depending on your requirements for document retention, archiving, e-discovery, etc. the document management system can help you fulfill all of those automatically.
     

  • Knowledge Tree (Score:3, Informative)

    by crackervoodoo ( 663384 ) on Wednesday June 10, 2009 @05:39PM (#28285773) Homepage
    http://www.knowledgetree.com/ [knowledgetree.com] If you're looking for a no-cost (read as no license fee) option then Knowledge Tree Community Edition is a decent Document Management tool. We've been using it for a couple of years.
  • by Anonymous Coward on Wednesday June 10, 2009 @05:42PM (#28285829)

    While this may be an odd suggestion, here's two things:
    1) Get yourself a damn good document or content management system. Get it set up on the baddest machines you can afford.Overshoot the capability you need, so that you have room to grow.
    2) Get a librarian to look at the kinds of documents you create, and develop a system to catalog documents while maintaining reasonable standards for file names. As the super simplest system, maybe document names that indicate (at a minimum) what project or what overhead department they belong to, a broad category of subject matter, and if it's versioned, a version number.

    I tried to bludgeon a small company I worked for (around 40 engineers, one overworked Q&A person, and one system administrator) into moving towards a storage system for word documents that was not "Create a new folder for each version of the document set, place them all in the right folder, and if you don't Ray will eat your head." We wound up using (of all things) Perforce SCM to house fifty thousand word documents, and were starting on putting actual code revisions for automated test sets into the system when our avionics testing focus became a serious liability, and overhead workers were drastically cut. (Why have one Q&A guy and one system admin guy? We can get an intern to do BOTH!)

  • by jwilkins13 ( 661548 ) on Wednesday June 10, 2009 @05:43PM (#28285861) Homepage
    Sure, with any number of ECM solutions. At the simplest end many of them simply enforce naming conventions; at the more robust end, they support many different file types for viewing, indexing, etc. and can also provide rich metadata on a document-by-document basis. Some of them have been named in the comments, including but certainly not limited to SharePoint 2007, Cygnet, Documentum, Open Text, FileNet, etc. Any system worth looking at has a web-based interface, at least for searching, and many of them offer for more meaningful interaction as well. Alfresco, Hyland, and SpringCM all have web-based ECM solutions and more comprehensive web-based offerings are available all the time. Oh - and if you're aerospace there are a number of regulatory requirements for information management you'll need to comply with, which does complicate the situation but spending the ducats for software and/or consulting help is probably cheaper than whatever your litigation and regulatory audit support processes cost today. Hope this helps, Jesse Wilkins ECM and other stuff consultant jwilkins13 at gmail dot com
  • by kiwimate ( 458274 ) on Wednesday June 10, 2009 @05:43PM (#28285869) Journal

    Yes, but it's not that hard to find someone. But Hummingbird (now owned by Open Text) or any other Document Management System. You've got a bunch of documents. You need to manage them. Ergo, a document management system.

    Parent makes an excellent point, however: the single most critical component of a successful implementation is to get a skilled* consultant who can work with you to properly define the taxonomy. Everything else flows from there.

    * If you go with Hummingbird DM, "skilled" means "not one of their over priced professional services people". They're dreadful.

  • Switch to Apple... (Score:4, Informative)

    by Tibor the Hun ( 143056 ) on Wednesday June 10, 2009 @05:46PM (#28285911)

    I only partly jest, I know such a thing is damn near impossible to actually do, but in our Mac shop, such things are trivial. With one click of the mouse we enable spotlight searching on our Leopard AFP server and bam... all the clients have almost instantaneous search access to their docs.

  • WebDav (Score:4, Informative)

    by SplashMyBandit ( 1543257 ) on Wednesday June 10, 2009 @05:48PM (#28285949)
    There are a few options:
    • For relatively unstructured data without versioning you could serve them over HTTP with WebDAV (Apache) and use your existing HTTP security mechanisms. You wouldn't believe how relieved I've often been when I can get my (secured) resources from home-base while located at a clients site.
    • My outfit uses KnowledgeTree for versioned stuff (http://www.knowledgetree.com/)
    • Or you could embrace your dark-side and use Microsoft SharePoint (plus, with all the Microsoft bugs you'd have a job for life until your employeer goes bust). If you are a friend to your company you won't do this, plus your outfit has engineers and the good ones can spot trash solutions.

    If you users are naming their files with strange characters in them (assuming it's not due to Samba) then they will just have to live with it, you won't have time to sort out all the wierd names that (mostly MS-Word) users give to their filenames. The primary objective should be to give your users access to the files. Making the directory listing pretty ought to be a secondary concern.

  • Alfresco (Score:2, Informative)

    by SplashMyBandit ( 1543257 ) on Wednesday June 10, 2009 @05:53PM (#28286033)
    I forgot to mention Alfresco as well, although I've never personally tried it.
    http://www.alfresco.com/index-b2.html [alfresco.com]
  • Comment removed (Score:5, Informative)

    by account_deleted ( 4530225 ) on Wednesday June 10, 2009 @05:53PM (#28286037)
    Comment removed based on user account deletion
  • Laserfiche (Score:2, Informative)

    by wguy00 ( 985922 ) on Wednesday June 10, 2009 @05:53PM (#28286045)
    Laserfiche (or LF) is just what this is for. It is DOD, DOJ certified and crap, and is used by all branches of the military and several other areas of the government as their document management system. With several different software offerings, just about any situation can be taken care of. It's features include the ability to search based on document name, template information, or OCR'd text (which the software also takes care of). With add-on features such as Quick Fields, it may be able to automatically sort, add template information, OCR, name and then store the documents. It really is a nice way to go. Satellite offices can access and be either full or read-only users. It has the ability and modules to connect to just about any other type of data/information system (GIS, financial software, etc) and is very scalable.

    I was a tech for 5 years with a LF VAR. I'm not there anymore. We were constantly cleaning up messes left by other document management systems. Take your time with this thing and really plan your naming convention, folder hierarchy and user setup. It's easier to get it right(or as close to it as possible) then going back and having to fix it later. A good LF VAR should help you with this. Definitely check references of competing companies. Some VAR's are A LOT better than others.
  • Comment removed (Score:2, Informative)

    by account_deleted ( 4530225 ) on Wednesday June 10, 2009 @05:56PM (#28286089)
    Comment removed based on user account deletion
  • by ak_hepcat ( 468765 ) <slashdot&akhepcat,com> on Wednesday June 10, 2009 @05:58PM (#28286115) Homepage Journal

    Odd that the next story has a great idea for document management right in the summary...

    Hadoop!

  • Re:SharePoint? (Score:3, Informative)

    by moosesocks ( 264553 ) on Wednesday June 10, 2009 @06:08PM (#28286227) Homepage

    Mod parent up. I helped create a tag-based document retrieval system for my former employer using SharePoint. It actually worked quite well.

    Use the right tool for the job. It's got a nice interface (that's also very familiar-looking to most users), scales well, and integrates well with MS Office, which (like it or not) is used by 99.99% of the corporate world. It also handles non-office files just fine.

    That's not to say that Unix-based solutions don't have their place. During the migration, I actually employed a series of shell/python scripts to assist with several of the more mundane aspects of the process. These probably saved us a couple thousand man-hours that would have otherwise been spent categorizing the files.

  • SharePoint (Score:4, Informative)

    by PIPBoy3000 ( 619296 ) on Wednesday June 10, 2009 @06:12PM (#28286265)
    NASA is a big user of SharePoint, strangely enough. My coworkers run into their folks at conferences from time to time.

    I personally am ambivalent about SharePoint. Its roots are in document management, so it seems to do that relatively well. The publishing features are fairly nice as well. I don't think it's the best system for making web sites, but it may some day get there. Currently it feels like a 2.0 product (the magic rule is to never buy anything from Microsoft before 3.0).

    There are gotchas. SharePoint is tightly coupled with your clients. If everyone accessing the documents are using the latest version of Office, you'll be okay. If not, you'll run into problems. You may also need to throw a lot of hardware into SharePoint, as storing files inside of SQL has some built-in inefficiencies.

    Still, some of our users seem to love SharePoint, so it might be a good option for you.
  • by mrmeval ( 662166 ) <.moc.oohay. .ta. .lavemcj.> on Wednesday June 10, 2009 @06:16PM (#28286315) Journal

    http://en.wikipedia.org/wiki/Document_management_system [wikipedia.org]

    For that level of documentation you need to have a staff and get it properly indexed. You need a high level librarian. This would be someone with a masters degree at minimum in library science and at least a bachelors in information technology. They will not come cheap and they are a long term investment. The software is available, it is not trivial. Hiring a large number of people to recategorize and tag all the documents for the length of time that takes is also an expense but worth it. Once it's all in place maintaining it gets much easier.

    I've seen a system developed for Raytheon. They took all the old compartmentalized data Hughes had and put every scrap of paper through a scanner. It was exceptionally well done. This would display electronic files and would have the location of hard copy. Classified documents were in some cases indexed but were hard copy only afaik. There were some documents that were hard copy only, those were usually ones with an NDA or other restriction on making electronic copies. It had every thing mentioned wrt versioning and such. Documents spanned decades with hundreds of revisions and you could pull up and view any revision. Depending on how recent and what type of document you could view a change log. Older scanned ones did not have that unless they'd been important enough to reenter as modern documents which meant OCR or manually transcribed. Some schematics were reentered into the system in a modern format. The effort was worth it. Having that data is the only way some devices or parts could be made or repaired.

    http://en.wikipedia.org/wiki/Document_management_system [wikipedia.org]

  • by Kadin2048 ( 468275 ) <slashdot.kadin@xox y . net> on Wednesday June 10, 2009 @06:23PM (#28286377) Homepage Journal

    I have a personal bias, but I think IBM's FileNet [ibm.com] would solve this quite neatly. I've done implementations of it that are pretty much exactly what the OP describes.

    Customer has a share that's gotten totally out of control, just stuffed full of files. They want to make them available across multiple offices, generally without getting into complex VPN crap, and also want to simplify management, add more security / compartmentalization, or integrate it with corporate SSI. All doable. Runs on your choice of platforms, too. (Linux, Unix/AIX, Windows all OK as servers.)

    There are even tools that basically take a share drive and walk the directory structure, importing documents at extremely high volume and using the folder structure to categorize and tag the documents within FileNet. It's quite slick and can either be used as a one-shot migration from a traditional fileserver to FileNet, or as an ongoing thing (take all files in a particular directory or set of directories and commit them).

    Once you have the documents into FileNet you can access them over a web interface or via various desktop clients, and there is a nice API for integrating it with custom in-house applications if that's a requirement. Also, IBM makes some add-ons for Word and Excel (and maybe PowerPoint) that allow you to work directly with items stored in a FileNet repository. Plus, if down the road you want to get into "workflow" (basically building your document management system around your business process), that can be easily bolted on.

    Email is in profile if you want specific case studies or whitepapers, or if you want me to put you in touch with people who do these sorts of things regularly.

  • Re:SharePoint? (Score:3, Informative)

    by glitch23 ( 557124 ) on Wednesday June 10, 2009 @06:42PM (#28286575)

    That's true of any other solution in the same manor as SharePoint. But at least the data is stored in a SQL database and not something proprietary like the MS Exchange information store.

    Although the files are in a database you can change the view in the browser to be "explorer" and access the files using Windows File Sharing-like features (copy/paste) through the browser. This method of access though is an end-run around SharePoint's versioning system. New files can be uploaded in this manner as well. I presume that when you modify an existing document in this way that SharePoint just makes that version the newest one in the actual database. SharePoint is still no substitute for a properly standardized naming convention and folder structure. Yeah you can always do a SharePoint search for what you want but at work I never do searches because we have specific folders where we place stuff and I know that as long as people follow the standard then I can find what I'm looking for and so can everyone else. We don't have thousands of documents though so maybe with documents counted using 6 digits a standard naming convention is asking too much.

  • by sexconker ( 1179573 ) on Wednesday June 10, 2009 @06:52PM (#28286705)

    It's becoming quite a mess, sometimes quite slow, and there is really no naming or numbering convention in place for the files and directories. We end up with mixed casing, all uppercase, all lowercase, dashes and ampersands in the file names, and there are literally hundreds of directories to sort through before you can find the document you are looking for.

    Slow. Upgrade your network and VPN. You know that VPN layer is just killing your performance.

    No naming or numbering convention. Get one.

    Mixed casing. Learn How to Properly Case Folders (and documents).

    Dashes and ampersands. Are they a problem? Aesthetically unpleasant? I personally restrict punctuation in a filesystem to dashes, periods, and parenthesis (unless the punctuation is a replicable part of the name of the file/folder).

    Examples:
    01 - The First Track (vocal)
    02 - $lashhvertisements Attack!
    03 - Where Have All the A.C.'s Gone

    Develop your own method that works and be obsessed about it to the point where you would reburn a disc if one of the filenames was "01-Name" instead of "01 - Name".

    Hundreds of directories.
    Each file should have it's own folder.
    "That's insane!" you say. Start out with this mentality. If there is no reason at all to separate two files (they are part of the same thing) then place them in one folder, and make sure the folder is named all-encompasingly. Repeat for all files. If you get into a AB, BC, but not ABC situation, the solution is to have A and B and C, with A and C linking to B with your choice of shortcut/link/symlink/etc.
    Do this until all files are in folders. Then repeat with folders.

    There is NO substitute for organization and getting people on the same page. Develop some conventions. Task people to fix as they go. Check up to make sure people accessing documents are fixing as they go, and doing so according to convention. Once people are used to the convention, and once things are relatively organized, they won't ever need to search again. They'll instantly know where 99% of things are, and will be able to dig around and find anything else within seconds.

    The main problem you face is getting organized after already being unorganized. It isn't easy, but at least you're not dealing with millions of paper documents.

  • Re:SharePoint? (Score:3, Informative)

    by nighty5 ( 615965 ) on Wednesday June 10, 2009 @07:21PM (#28286987)
    We use SharePoint in a large enterprise although its pretty good at mashing together websites - unfortunately its really poor at search. I think Search 4.0 may improve the situation, but its nowhere near Yahoo, Google or other search technology. Technology doesn't solve all problems, I'd say this said company needs to focus on strengthening business process and implementing some user awareness programs.
  • by afidel ( 530433 ) on Wednesday June 10, 2009 @07:26PM (#28287049)
    I'd suggest Livelink by OpenText. I know the Airforce uses it since our Livelink guy worked on their systems before coming to work for us, they obviously work with large volumes of aerospace related documents! =) That probably means OpenText can find consultants who have already designed and worked with an aerospace taxonomy.
  • by Anonymous Coward on Wednesday June 10, 2009 @08:14PM (#28287443)

    Wow - Slashdot users must all be on their meds this week. Judging by the number of responses to that say to buy a google appliance, I judge the paranoia level to be closer to blue than red. Where the hell is the open source insight?

    I guess it's right here from good old Anonymous - ever hear of SOLR http://lucene.apache.org/solr/ ? It's free, it's opensource and even if you hire a consulting company to set up an index of everything you have, you'll pay pennies on the dollar compared to a google appliance! Plus - your soul will remain intact!

    Just google for SOLR consultants and you'll find them, no problem :-)

  • Re:SharePoint? (Score:4, Informative)

    by preystalker ( 591393 ) on Wednesday June 10, 2009 @08:35PM (#28287615)
    I would recommend using Alfresco. Correct configured and deployed, you could access files via Windows Explorer, WebDav, web interface, etc. and data is stored in a SQL database. Alfresco uses open standards and should be considered instead of SharePoint.
  • by oldspewey ( 1303305 ) on Wednesday June 10, 2009 @08:36PM (#28287625)

    LMAO ... Documentum is almost dead ... that's why they released V6.5 several months back and are on track to release V7 in H1'10.

    Why don't you tell us all how your "competitor" company scales to tens of millions of documents with high availablity, disaster recovery, and content caching across 5 continents?

  • Re:Google Appliance (Score:1, Informative)

    by VTBlue ( 600055 ) on Wednesday June 10, 2009 @08:44PM (#28287693)

    Try Search Server 2008 Express from Microsoft. Although it has no hard limits, it can index upto a 1 million documents before you have to scale out. Best of all it is free!

    If you need high availability, redundancy, fail-over or more document support, look at the standard version of the product or consider SharePoint 2007/2010 or FAST.

    http://www.microsoft.com/enterprisesearch/en/us/search-server-express.aspx#none [microsoft.com]

    msg me, if you have questions, I work at Microsoft.

  • by shri ( 17709 ) <shriramc.gmail@com> on Wednesday June 10, 2009 @09:43PM (#28288177) Homepage
    May I also suggest Yahoo/IBM's OmniFind [yahoo.net] as a free as beer alternative?
  • by NacMacFeegle ( 762567 ) on Thursday June 11, 2009 @03:29AM (#28290313)
    Some of the suggestions above says that you should just chuck everything haphazardly into a big pile and then use search engines to trawl the whole mess. I don't buy that. Instead, (like some others) I'd suggest a proper content management system such as the ones from http://www.alfresco.com/ [alfresco.com], http://www.interwoven.com/ [interwoven.com] or http://www.hummingbird.com/ [hummingbird.com].

    The reason for this suggestion is that I know that these systems are being used by organisations which handle, as OP said, hundreds of thousands of documents and which have satellite offices (e.g. large multinational lawfirms). They provide several benefits such as the possibility to structure projects, have both project related documents and e-mails saved and indexed in the project folders, allows for searching and proper document version chains (meaning that you can revert to older versions of documents if some klutz breaks a newer version).

    Of course, this means quite an investment, a learning curve for everyone at your company and, most likely, the hiring of an individual with experience of the chosen system.

Always draw your curves, then plot your reading.

Working...