Catch up on stories from the past week (and beyond) at the Slashdot story archive


Forgot your password?
Data Storage Networking IT

How To Manage Hundreds of Thousands of Documents? 438 438

ajmcello78 writes "We're a mid-sized aerospace company with over a hundred thousand documents stored out on our Samba servers that also need to be accessed from our satellite offices. We have a VPN set up for the remote sites and use the Samba net use command to map the remote shares. It's becoming quite a mess, sometimes quite slow, and there is really no naming or numbering convention in place for the files and directories. We end up with mixed casing, all uppercase, all lowercase, dashes and ampersands in the file names, and there are literally hundreds of directories to sort through before you can find the document you are looking for. Does anybody know of a good system or method to manage all these documents, and also make them available to our satellite offices?"
This discussion has been archived. No new comments can be posted.

How To Manage Hundreds of Thousands of Documents?

Comments Filter:
  • by Shatrat (855151) on Wednesday June 10, 2009 @05:29PM (#28285597)
    Isn't this the sort of thing that a google search appliance would be helpful for? Then you don't need to know the exact filename, just some specific information that can identify the file. This certainly solved my problem with having thousands of emails.
  • by Sir_Lewk (967686) <> on Wednesday June 10, 2009 @05:33PM (#28285657)

    and there is really no naming or numbering convention in place for the files and directories.

    I think you already know the answer.

  • by HikingStick (878216) <{z01riemer} {at} {}> on Wednesday June 10, 2009 @05:34PM (#28285667)
    If they're going to consider Hummingbird, they need to be ready to cough up the dollars to get an *EXPERIENCED* Hummingbird administrator. If not, the product will be set up, but basic search functionality will be hosed because of some of the same issues in the original problem description (arising from differences in how the document's properties sheets are populated). If done well, it can be fantastic. If not, it users will hate it and do everything possible to avoid it (including installing their own NAS devices).
  • by flydpnkrtn (114575) on Wednesday June 10, 2009 @05:35PM (#28285697)
    Or some other corporate content management system
  • by Wrexs0ul (515885) <> on Wednesday June 10, 2009 @05:39PM (#28285771) Homepage

    Most print companies like Xerox have their own proprietary Document management [] tools you can buy, and a bunch of CRM and ERP solutions (like OpenERP - it's free AND Open Source) provide some good simple document searching and indexing tools.

    Really it comes down to how complex you want searching to be? Are there specific keys in the document you could index by? Do you require the full-text search capabilities of a Google search appliance?

    A really good solution I've come across for some clients in Edmonton is Called MetalTrace [] by Trace Applications. Don't let the name fool you about the specificity, software like this can Scan, Index, and even read barcodes on all sorts of documents then let people search for it via the web. Their "killer-app" has multiple user-defined document types with multiple search fields, combined with some back-filing (digital and scanning) really saved the day.

    Do your research though on "Document managment" and see what product best fits your needs. It's a really well established field so reinventing the wheel is a little masochistic... not that there's anything wrong with that. ;)


  • Re:SharePoint? (Score:4, Insightful)

    by goffster (1104287) on Wednesday June 10, 2009 @05:39PM (#28285777)

    Why should you give sharepoint a chance? Even it it works well, it is proprietary and you are locked in.

  • by fxdgear (1574093) on Wednesday June 10, 2009 @05:47PM (#28285923)
    I'm gonna say nothing beats a proper folder structure and naming convention. I'd also recommend using svn. Also spend some time to develop some macros to assist in the creation/saving/retrieval of said documents from the repository. Maybe create some standard templates too... just my 2cents!
  • by Vroom_Vroom (29347) on Wednesday June 10, 2009 @05:50PM (#28285987)

    Hire a document manager / clerk person who will create order. Your engineers won't.

  • Re:Google wave (Score:3, Insightful)

    by Gerzel (240421) * <.brollyferret. .at.> on Wednesday June 10, 2009 @05:51PM (#28285995) Journal

    Or better yet talk to people who've done it before. I mean seriously there have been organizations managing hundreds of thousands of documents since the Roman Era, its nothing new.

  • by nine-times (778537) <> on Wednesday June 10, 2009 @06:05PM (#28286193) Homepage

    Yeah, some people mentioned Google appliances, which I suppose is a sort-of solution. I've never used one of those internally, but I wouldn't trust that to be the end-all solution to your organizational problems. What if there's a file that Google can't read or gather good metadata for? What if you're searching for common terms, and the file you're looking for is on the 75th page? What if you're not remembering the correct search parameters and so your file just isn't turning up in your searches?

    There's really no substitute yet for real organization and discipline. The first thing you should do is define your needs/parameters. Does everyone from every site need read access to all files? Do they all need write access? Most likely, the answer to both of these questions is "no", so narrow it down to specifically "who needs access to what". That will help you figure out the rest of these things. Also ask, who needs to be able to find which documents under which circumstances? What information will they have? You're going to want to use those pieces of information in your organization so that people can intuitively find the files that they need, without necessarily needing to see everyone else's files.

    Come up with a hierarchical organization for your files, requesting user input if appropriate. Then create a directory structure that matches it. Make sure you've communicated the organization clearly to your users, and try to get them to use it.

    If necessary, use directory permissions to try to restrict writing files to appropriate places. For example, if you break down the file structure by particular engineering groups or departments, then only provide write access to members of that group or department. Designate the head of that department as the person responsible for organization within that folder. If need be, restrict write access in a particular folder to only one person, and make that person responsible for checking files in and maintaining the organization for the group or department. Do the same sort of control with individual satellite sites, if appropriate.

    Be a little tiny bit of a control freak, but you might want to give people a particular folder share where they can transfer files in a more freeform manner in a pinch. Someone might want to share one particular file, back something up for a minute, or whatever, but make it clear that this share is completely insecure and temporary. Let people know that everyone has access to that share, anyone can delete any file, you won't be backing it up, and in fact you might be clearing it out (deleting it) on a regular basis. Make a habit of deleting it all on a regular basis, or people will start dumping everything there to sidestep the organization. To be careful, you might want to actually move everything into a non-shared folder for a week, and then deleting it later, so if someone shows up and says, "Oh crap! You deleted business-critical information!" you can sigh, and say, "I'll see what I can do, but you really shouldn't store business-critical data there."

    So, to go back and summarize: Come up with an organization, stick to it, enforce it, and retrain your users to use it properly.

  • by gwait (179005) on Wednesday June 10, 2009 @06:27PM (#28286429)

    I agree - although you might want to eventually implement a systematic method of naming/storing your documents.
    The google appliance (or some other reasonably fast "WAN" search tool) would let you find files in the current rats nest "as is", making it easier to organize them to the new "standard".

  • SharePoint wiki (Score:1, Insightful)

    by Anonymous Coward on Wednesday June 10, 2009 @06:34PM (#28286499)

    I know I'm gonna get hit for blurting out the Microsoft Solution but...give SharePoint a shot...

    Just avoid the wiki functionality like the plague. It completely sucks.

  • by Gary W. Longsine (124661) on Wednesday June 10, 2009 @06:35PM (#28286505) Homepage Journal
    Since your organization probably has Windows clients, you can only long for something as nice as Mac OS X Spotlight Server [].

    Google Search Appliance is definitely what you want.

    If you have a mid sized company you definitely don't have the surplus of highly talented systems administrator talent laying about to run one of the document management systems that others here are likely to suggest. Be very careful going down the document management server path. It's far, far more work than you think it will be, than the vendor will tell you it is. Not simply more work for you, but for your IT staff and your users, too.

    The Google Search Appliance, by contrast, is "fire and forget". Plug it in. Turn it on. Patch it when Google suggests you do so. That's about it.
  • Re:SharePoint? (Score:3, Insightful)

    by Anonymous Coward on Wednesday June 10, 2009 @06:43PM (#28286589)

    What you say:

    Why should you give sharepoint a chance? Even it it works well, it is proprietary and you are locked in.

    What you mean:

    Regardless of how perfect a solution might be for you, if it doesn't conform to MY personal ideological viewpoint, it shouldn't be given a chance.

    God I hate people like you.


  • by MyDixieWrecked (548719) on Wednesday June 10, 2009 @06:47PM (#28286623) Homepage Journal

    Most print companies like Xerox have their own proprietary Document management [] tools you can buy

    Document management software is great, but when you have enormous numbers of documents (100s of thousands like in the summary), it becomes necessary to have a content management system in place. Something that's intelligent enough to break the documents up into pieces and allow searches, but something more robust than full-text search.

    We've been using this software called MarkLogic Server ( It's an XML database and has a content processing framework for document ingestion. So, basically, assuming that documents are structured similarly, they can be converted into XML so they can be queried with custom weights being applied to content in different portions of the document. The software has built-in Word support so it'll automatically convert .doc files with proper formatting as well as the ability to add custom handlers for other formats including plaintext.

    We're currently managing a couple million documents and generating dynamic documents on the fly for some processes. Since on-the-fly documents may take time to generate, we have a system in place that saves the result in the database which can also be queried at a later date. It's all really cool.

    Of course, there's a bit of a learning curve to writing your own software for it since it uses XQuery, but it's not much harder to learn than SQL, and so far, it seems to be far more powerful.

    Disclaimer: I'm not a shill nor am I being paid in any way by MarkLogic... I'm just seriously blown away by what their technology has enabled us to do.

  • by dimeglio (456244) on Wednesday June 10, 2009 @06:52PM (#28286701)

    Skilled consultants are great but without training employees you'll keep on paying big $ for consultants whenever there's a change to make. Let the consultant show how and let the employees do the work. BTW: We have 3000+ users (all happy) on their system and no consultant.

  • by liquidsin (398151) on Wednesday June 10, 2009 @07:04PM (#28286831) Homepage

    use your users, if you can. i'm just talking out my ass here, but i'd think it a not-too-difficult matter to add some sort of user input form along the lines of "hey, now that you've found the document you need, does the name fit the new naming scheme? if not, why not rename it so it fits!". this is assuming you can trust your userbase not to be asshats and to be able to follow the naming protocol.

  • by CozmicCharlie (1471823) on Wednesday June 10, 2009 @07:12PM (#28286897)
    Now I actually LOL'd on that one! Getting our userbase to actually give a flying fart about a naming protocol and then getting them to follow it!? I won't be holding my breath for either of those two things to happen...
  • Re:WebDav (Score:1, Insightful)

    by Anonymous Coward on Wednesday June 10, 2009 @07:44PM (#28287195)

    The weird characters could easily be taken care of by something like Ant Renamer (even supports RegEx). Just replace the weird ones with an underscore or some other suitable character.

  • Re:Google wave (Score:5, Insightful)

    by Anarchduke (1551707) on Wednesday June 10, 2009 @08:51PM (#28287745)
    There is a whole profession dedicated to this, and there is a major in college specifically designed to assist in organizing documents into meaningful collections.

    I suggest your company look at hiring a library sciences major, since this is what they do.
  • by Anonymous Coward on Thursday June 11, 2009 @06:09PM (#28301301)

    Now I actually LOL'd on that one!

    Getting our userbase to actually give a flying fart about a naming protocol and then getting them to follow it!?

    I won't be holding my breath for either of those two things to happen...

    You obviously don't know how to motivate people. Tell your boss you can get everything renamed for $100/week. Then post a leader board showing who has renamed the most documents each week, and give each week's winner a gift certificate to a local restaurant. Don't let anyone win more than once a month, to prevent too much disruption of normal job duties, and set up some sort of meta-moderation to prevent gaming the system. (You could probably use slashcode out-of the-box, just make each document a story and suggest better names in the comments.)

    Laughing my ass off, here...

    What do you think an enterprise working environment is? College dorm?

    "Document renamer of the week"?

    You either are still in high school or in Dilbert like upper management to think that this has any remote chance of working (or makes any sense).

"Everything should be made as simple as possible, but not simpler." -- Albert Einstein