Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
×
Data Storage Networking IT

How To Manage Hundreds of Thousands of Documents? 438

ajmcello78 writes "We're a mid-sized aerospace company with over a hundred thousand documents stored out on our Samba servers that also need to be accessed from our satellite offices. We have a VPN set up for the remote sites and use the Samba net use command to map the remote shares. It's becoming quite a mess, sometimes quite slow, and there is really no naming or numbering convention in place for the files and directories. We end up with mixed casing, all uppercase, all lowercase, dashes and ampersands in the file names, and there are literally hundreds of directories to sort through before you can find the document you are looking for. Does anybody know of a good system or method to manage all these documents, and also make them available to our satellite offices?"
This discussion has been archived. No new comments can be posted.

How To Manage Hundreds of Thousands of Documents?

Comments Filter:
  • by Shatrat ( 855151 ) on Wednesday June 10, 2009 @05:29PM (#28285597)
    Isn't this the sort of thing that a google search appliance would be helpful for? Then you don't need to know the exact filename, just some specific information that can identify the file. This certainly solved my problem with having thousands of emails.
    • Re: (Score:3, Funny)

      by mikael ( 484 )

      Just put everything up on a P2P server - then everyone can look for the documents at the same time as they are looking for their favourite Linux distro.

    • Re: (Score:3, Insightful)

      by gwait ( 179005 )

      I agree - although you might want to eventually implement a systematic method of naming/storing your documents.
      The google appliance (or some other reasonably fast "WAN" search tool) would let you find files in the current rats nest "as is", making it easier to organize them to the new "standard".

      • by liquidsin ( 398151 ) on Wednesday June 10, 2009 @07:04PM (#28286831) Homepage

        use your users, if you can. i'm just talking out my ass here, but i'd think it a not-too-difficult matter to add some sort of user input form along the lines of "hey, now that you've found the document you need, does the name fit the new naming scheme? if not, why not rename it so it fits!". this is assuming you can trust your userbase not to be asshats and to be able to follow the naming protocol.

        • by CozmicCharlie ( 1471823 ) on Wednesday June 10, 2009 @07:12PM (#28286897)
          Now I actually LOL'd on that one! Getting our userbase to actually give a flying fart about a naming protocol and then getting them to follow it!? I won't be holding my breath for either of those two things to happen...
          • by vrmlguy ( 120854 ) <samwyse AT gmail DOT com> on Wednesday June 10, 2009 @09:54PM (#28288301) Homepage Journal

            Now I actually LOL'd on that one!

            Getting our userbase to actually give a flying fart about a naming protocol and then getting them to follow it!?

            I won't be holding my breath for either of those two things to happen...

            You obviously don't know how to motivate people. Tell your boss you can get everything renamed for $100/week. Then post a leader board showing who has renamed the most documents each week, and give each week's winner a gift certificate to a local restaurant. Don't let anyone win more than once a month, to prevent too much disruption of normal job duties, and set up some sort of meta-moderation to prevent gaming the system. (You could probably use slashcode out-of the-box, just make each document a story and suggest better names in the comments.)

      • Re: (Score:3, Interesting)

        I agree - although you might want to eventually implement a systematic method of naming/storing your documents.

        While this seems like a good idea on the surface, it never seems to work very well. Even verbose file names seem to fail miserably, as the first 100 or so letters are always the same (IE:"Project Tiger Sausage rocket module assembly - Ion injector harware part 1 ...>

        Then there is the problem of getting all the employees to fully understand directory structures. Just look at your workmates screens to see how many people save everything on their desktop. (Yes, really a windows problem ... but so what.)

    • by shri ( 17709 ) <.moc.liamg. .ta. .cmarirhs.> on Wednesday June 10, 2009 @09:43PM (#28288177) Homepage
      May I also suggest Yahoo/IBM's OmniFind [yahoo.net] as a free as beer alternative?
  • Google Appliance (Score:5, Informative)

    by TornCityVenz ( 1123185 ) on Wednesday June 10, 2009 @05:31PM (#28285635) Homepage Journal
    • by Gary W. Longsine ( 124661 ) on Wednesday June 10, 2009 @06:35PM (#28286505) Homepage Journal
      Since your organization probably has Windows clients, you can only long for something as nice as Mac OS X Spotlight Server [apple.com].

      Google Search Appliance is definitely what you want.

      If you have a mid sized company you definitely don't have the surplus of highly talented systems administrator talent laying about to run one of the document management systems that others here are likely to suggest. Be very careful going down the document management server path. It's far, far more work than you think it will be, than the vendor will tell you it is. Not simply more work for you, but for your IT staff and your users, too.

      The Google Search Appliance, by contrast, is "fire and forget". Plug it in. Turn it on. Patch it when Google suggests you do so. That's about it.
  • by Daimanta ( 1140543 ) on Wednesday June 10, 2009 @05:33PM (#28285655) Journal

    Store it on a single FAT32 partition and hope for the best. Only meant for people with guts or really really nice bosses.

  • by Sir_Lewk ( 967686 ) <sirlewkNO@SPAMgmail.com> on Wednesday June 10, 2009 @05:33PM (#28285657)

    and there is really no naming or numbering convention in place for the files and directories.

    I think you already know the answer.

    • Comment removed (Score:5, Informative)

      by account_deleted ( 4530225 ) on Wednesday June 10, 2009 @05:53PM (#28286037)
      Comment removed based on user account deletion
      • You can look also at OpenDMS [sourceforge.net]. It's not very active lately but might have a good core that you can expand on.

    • In all honesty, I tend to agree with what you're implying. A database solution is great, if you put it into place immediately, otherwise you have to spend a lot of time getting all of the items into the database and properly tagged and sorted.

      One way or another the work is going to have to be done, the relevant question is how easily will it be maintained, how will it handled increases in size and how easily can it be backed up.

      I'm doing this sort of thing right now with my digital images. Thankfully,
      • otherwise you have to spend a lot of time getting all of the items into the database and properly tagged and sorted

        Document clustering software can make that less painful by giving an overview of what you have (possibly hierarchical), and allowing you to categorize dozens (or even thousands) of related documents with a single mouse click. Blatant plug: Clustify [cluster-text.com].

    • Re: (Score:3, Insightful)

      by nine-times ( 778537 )

      Yeah, some people mentioned Google appliances, which I suppose is a sort-of solution. I've never used one of those internally, but I wouldn't trust that to be the end-all solution to your organizational problems. What if there's a file that Google can't read or gather good metadata for? What if you're searching for common terms, and the file you're looking for is on the 75th page? What if you're not remembering the correct search parameters and so your file just isn't turning up in your searches?

      There'

    • by CorporateSuit ( 1319461 ) on Wednesday June 10, 2009 @06:59PM (#28286777)
      No kidding, men are practically born with this instinct.

      The most basic is dividing the images up according to hair color or the number of girls appearing in each photo. Then you usually divide them up between hardcore and softcore, type of performance, fetish, etc. For your favorites, you can keep a folder in the home directory, of course. I know this guy works for an aerospace company, but keeping track of 500,000+ files isn't rocket science! We've all been able to do that since the advent of the 200GB harddrive.
  • by Hognoxious ( 631665 ) on Wednesday June 10, 2009 @05:34PM (#28285663) Homepage Journal

    The lack of a naming convention for the filenames and directories is neither here nor there. What matters is how well it's indexed.

    Now I use naming conventions for my files (photos ,mp3s etc). Am i contradicting myself? No, it's because I don't have enough of them that I need a separate index.

  • by Swampash ( 1131503 ) on Wednesday June 10, 2009 @05:35PM (#28285691)
  • by flydpnkrtn ( 114575 ) on Wednesday June 10, 2009 @05:35PM (#28285697)
    Or some other corporate content management system
    • Re: (Score:2, Informative)

      Comment removed based on user account deletion
    • by flydpnkrtn ( 114575 ) on Wednesday June 10, 2009 @06:01PM (#28286161)
      ...and I found an article [networkworld.com] backing up Alfresco pretty well:

      "You can now stand up an Alfresco Labs server next to a SharePoint Server, and Office will not be able to tell the difference between the two," said John Newton, CTO of Alfresco. "But we are offering considerably more scale than SharePoint can deliver," he said.
    • Re: (Score:3, Informative)

      by Kadin2048 ( 468275 )

      I have a personal bias, but I think IBM's FileNet [ibm.com] would solve this quite neatly. I've done implementations of it that are pretty much exactly what the OP describes.

      Customer has a share that's gotten totally out of control, just stuffed full of files. They want to make them available across multiple offices, generally without getting into complex VPN crap, and also want to simplify management, add more security / compartmentalization, or integrate it with corporate SSI. All doable. Runs on your choice of

    • Re: (Score:3, Informative)

      by afidel ( 530433 )
      I'd suggest Livelink by OpenText. I know the Airforce uses it since our Livelink guy worked on their systems before coming to work for us, they obviously work with large volumes of aerospace related documents! =) That probably means OpenText can find consultants who have already designed and worked with an aerospace taxonomy.
  • ...Setting up a standard naming convention and make sure bosses and managers enforce it. It won't help older files but will stop it getting worse!

    Then if you can be bothered, you can start going through older files and updating the naming conventions or entering them into the Document management system of you choice...

    • Hire a temp employee for $10/hr to go through and rename everything, or do any other clerical grunt work.
      • Only problem is that this is an aerospace company, they might get lucky finding somebody that's capable and willing to work for peanuts, but I wouldn't count on it. Realistically they may require somebody with technical know how of what the files actually are in order to properly categorize them. A temp might be able to handle reformatting the file names based upon information in the name, but probably not much more than that.
        • Some training and the temp should be able to recognize different file types and where/how to classify them. The temp doesn't have to understand the information, just has to know that "Hey this looks like X, I was told X goes here" or "Oh, it says Y in spot B in the file, it must go to Place Z". The person with technical know how is going ot have to look in th efile for a clue as to where it should be filed; they can impart that small bit of information to a temp without having to teach taech them everythin
  • by vondo ( 303621 ) on Wednesday June 10, 2009 @05:35PM (#28285707)

    I happen to have written one:

    http://sourceforge.net/projects/docdb-v/ [sourceforge.net]

    could be what you are looking for. Of course, it'll take effort to catalog the documents.

  • SharePoint? (Score:5, Informative)

    by tekiegreg ( 674773 ) * <tekieg1-slashdot@yahoo.com> on Wednesday June 10, 2009 @05:36PM (#28285711) Homepage Journal
    I know I'm gonna get hit for blurting out the Microsoft Solution but...give SharePoint a shot...
    • Re:SharePoint? (Score:4, Insightful)

      by goffster ( 1104287 ) on Wednesday June 10, 2009 @05:39PM (#28285777)

      Why should you give sharepoint a chance? Even it it works well, it is proprietary and you are locked in.

      • Re:SharePoint? (Score:5, Interesting)

        by moosesocks ( 264553 ) on Wednesday June 10, 2009 @06:15PM (#28286305) Homepage

        Why should you give sharepoint a chance? Even it it works well, it is proprietary and you are locked in.

        No less proprietary than other similar systems. Getting files in/out of Sharepoint is a fairly trivial process, and the API is open enough to craft your own migration plan if you ever decide to move away from it, given that everything else is equally (or even more) proprietary than Sharepoint.

        MS Office might be proprietary, but is so widespread that it's a 'standard' in its own right -- Sharepoint integrates excellently with Office, and keeps your users happy.

        I'm typically not one to advocate the use of Microsoft products. However, Sharepoint worked just fine when I was using it, and is definitely a huge step up from any of the competing products at the same price-level.

        • Re:SharePoint? (Score:4, Informative)

          by preystalker ( 591393 ) on Wednesday June 10, 2009 @08:35PM (#28287615)
          I would recommend using Alfresco. Correct configured and deployed, you could access files via Windows Explorer, WebDav, web interface, etc. and data is stored in a SQL database. Alfresco uses open standards and should be considered instead of SharePoint.
      • Comment removed based on user account deletion
        • Re: (Score:3, Informative)

          by glitch23 ( 557124 )

          That's true of any other solution in the same manor as SharePoint. But at least the data is stored in a SQL database and not something proprietary like the MS Exchange information store.

          Although the files are in a database you can change the view in the browser to be "explorer" and access the files using Windows File Sharing-like features (copy/paste) through the browser. This method of access though is an end-run around SharePoint's versioning system. New files can be uploaded in this manner as well. I presume that when you modify an existing document in this way that SharePoint just makes that version the newest one in the actual database. SharePoint is still no substitute for a properly

      • Re: (Score:3, Insightful)

        by Anonymous Coward

        What you say:

        Why should you give sharepoint a chance? Even it it works well, it is proprietary and you are locked in.

        What you mean:

        Regardless of how perfect a solution might be for you, if it doesn't conform to MY personal ideological viewpoint, it shouldn't be given a chance.

        God I hate people like you.

        --AC

    • Re: (Score:3, Informative)

      by moosesocks ( 264553 )

      Mod parent up. I helped create a tag-based document retrieval system for my former employer using SharePoint. It actually worked quite well.

      Use the right tool for the job. It's got a nice interface (that's also very familiar-looking to most users), scales well, and integrates well with MS Office, which (like it or not) is used by 99.99% of the corporate world. It also handles non-office files just fine.

      That's not to say that Unix-based solutions don't have their place. During the migration, I actually

    • SharePoint (Score:4, Informative)

      by PIPBoy3000 ( 619296 ) on Wednesday June 10, 2009 @06:12PM (#28286265)
      NASA is a big user of SharePoint, strangely enough. My coworkers run into their folks at conferences from time to time.

      I personally am ambivalent about SharePoint. Its roots are in document management, so it seems to do that relatively well. The publishing features are fairly nice as well. I don't think it's the best system for making web sites, but it may some day get there. Currently it feels like a 2.0 product (the magic rule is to never buy anything from Microsoft before 3.0).

      There are gotchas. SharePoint is tightly coupled with your clients. If everyone accessing the documents are using the latest version of Office, you'll be okay. If not, you'll run into problems. You may also need to throw a lot of hardware into SharePoint, as storing files inside of SQL has some built-in inefficiencies.

      Still, some of our users seem to love SharePoint, so it might be a good option for you.
    • by jockeys ( 753885 )
      +1.

      I'm no MS fanboy, but Sharepoint is great. I work for a large engineering company and we use it to organize blueprints, as well as pretty much all of our non-code documents. Even the most clueless HR-types can use it, and it's really not hard to set up.
    • by Itninja ( 937614 )
      Totally agree. SharePoint is one of the few recent products that Microsoft actually got right. Of course, they will probably find a way to screw it up down the road, but currently it rocks as an enterprise level document repository.
    • Re: (Score:3, Interesting)

      by pete-classic ( 75983 )

      How does Sharepoint address his problem? It uses the exact same folder/file paradigm that is failing in his existing solution.

      -Peter

    • Add another me too for Sharepoint.
      From the initial question, I'd guess that just WSS3 will get the job done and it's free. One important piece of this though is: plan your deployment. Figure out what type of site structure you plan to use before you implement anything. Sharepoint can be a wonderful tool, but if you just jump into it and let it grow organically you will end up hating it and yourself. And trying to monkey around with the site structure after the fact can be trouble. Oh and, get famil
    • Re: (Score:3, Informative)

      by nighty5 ( 615965 )
      We use SharePoint in a large enterprise although its pretty good at mashing together websites - unfortunately its really poor at search. I think Search 4.0 may improve the situation, but its nowhere near Yahoo, Google or other search technology. Technology doesn't solve all problems, I'd say this said company needs to focus on strengthening business process and implementing some user awareness programs.
  • Cygnet ECM might work for you.

  • Documentum (Score:2, Interesting)

    by trondwn ( 1574085 )
    use EMC document solution, where you have all documents i central database with metadata that can describe content. And can be accessed thru cached server from different sites.
  • If you need to use just plain documents, store then in on big directory, update the meta information.
    Let people move links onto there system and organize the links how the like, but don't let them move the documents.

    Think iTunes for documents. I loath that example since I have set this sort of thing long before iTunes came around.

    If you on collaborative use of your documents get something like this:
    Jive.com

  • by Wrexs0ul ( 515885 ) <mmeier AT racknine DOT com> on Wednesday June 10, 2009 @05:39PM (#28285771) Homepage

    Most print companies like Xerox have their own proprietary Document management [wikipedia.org] tools you can buy, and a bunch of CRM and ERP solutions (like OpenERP - it's free AND Open Source) provide some good simple document searching and indexing tools.

    Really it comes down to how complex you want searching to be? Are there specific keys in the document you could index by? Do you require the full-text search capabilities of a Google search appliance?

    A really good solution I've come across for some clients in Edmonton is Called MetalTrace [traceapps.com] by Trace Applications. Don't let the name fool you about the specificity, software like this can Scan, Index, and even read barcodes on all sorts of documents then let people search for it via the web. Their "killer-app" has multiple user-defined document types with multiple search fields, combined with some back-filing (digital and scanning) really saved the day.

    Do your research though on "Document managment" and see what product best fits your needs. It's a really well established field so reinventing the wheel is a little masochistic... not that there's anything wrong with that. ;)

    -Matt

    • by MyDixieWrecked ( 548719 ) on Wednesday June 10, 2009 @06:47PM (#28286623) Homepage Journal

      Most print companies like Xerox have their own proprietary Document management [wikipedia.org] tools you can buy

      Document management software is great, but when you have enormous numbers of documents (100s of thousands like in the summary), it becomes necessary to have a content management system in place. Something that's intelligent enough to break the documents up into pieces and allow searches, but something more robust than full-text search.

      We've been using this software called MarkLogic Server (http://marklogic.com). It's an XML database and has a content processing framework for document ingestion. So, basically, assuming that documents are structured similarly, they can be converted into XML so they can be queried with custom weights being applied to content in different portions of the document. The software has built-in Word support so it'll automatically convert .doc files with proper formatting as well as the ability to add custom handlers for other formats including plaintext.

      We're currently managing a couple million documents and generating dynamic documents on the fly for some processes. Since on-the-fly documents may take time to generate, we have a system in place that saves the result in the database which can also be queried at a later date. It's all really cool.

      Of course, there's a bit of a learning curve to writing your own software for it since it uses XQuery, but it's not much harder to learn than SQL, and so far, it seems to be far more powerful.

      Disclaimer: I'm not a shill nor am I being paid in any way by MarkLogic... I'm just seriously blown away by what their technology has enabled us to do.

  • Knowledge Tree (Score:3, Informative)

    by crackervoodoo ( 663384 ) on Wednesday June 10, 2009 @05:39PM (#28285773) Homepage
    http://www.knowledgetree.com/ [knowledgetree.com] If you're looking for a no-cost (read as no license fee) option then Knowledge Tree Community Edition is a decent Document Management tool. We've been using it for a couple of years.
  • I used an old version a while ago and it was pretty good then. Does versioning and other things.

    http://www.knowledgetree.com/ [knowledgetree.com]

  • by Anonymous Coward

    While this may be an odd suggestion, here's two things:
    1) Get yourself a damn good document or content management system. Get it set up on the baddest machines you can afford.Overshoot the capability you need, so that you have room to grow.
    2) Get a librarian to look at the kinds of documents you create, and develop a system to catalog documents while maintaining reasonable standards for file names. As the super simplest system, maybe document names that indicate (at a minimum) what project or what overhead

  • Sure, with any number of ECM solutions. At the simplest end many of them simply enforce naming conventions; at the more robust end, they support many different file types for viewing, indexing, etc. and can also provide rich metadata on a document-by-document basis. Some of them have been named in the comments, including but certainly not limited to SharePoint 2007, Cygnet, Documentum, Open Text, FileNet, etc. Any system worth looking at has a web-based interface, at least for searching, and many of them o
  • Switch to Apple... (Score:4, Informative)

    by Tibor the Hun ( 143056 ) on Wednesday June 10, 2009 @05:46PM (#28285911)

    I only partly jest, I know such a thing is damn near impossible to actually do, but in our Mac shop, such things are trivial. With one click of the mouse we enable spotlight searching on our Leopard AFP server and bam... all the clients have almost instantaneous search access to their docs.

  • I'm gonna say nothing beats a proper folder structure and naming convention. I'd also recommend using svn. Also spend some time to develop some macros to assist in the creation/saving/retrieval of said documents from the repository. Maybe create some standard templates too... just my 2cents!
  • WebDav (Score:4, Informative)

    by SplashMyBandit ( 1543257 ) on Wednesday June 10, 2009 @05:48PM (#28285949)
    There are a few options:
    • For relatively unstructured data without versioning you could serve them over HTTP with WebDAV (Apache) and use your existing HTTP security mechanisms. You wouldn't believe how relieved I've often been when I can get my (secured) resources from home-base while located at a clients site.
    • My outfit uses KnowledgeTree for versioned stuff (http://www.knowledgetree.com/)
    • Or you could embrace your dark-side and use Microsoft SharePoint (plus, with all the Microsoft bugs you'd have a job for life until your employeer goes bust). If you are a friend to your company you won't do this, plus your outfit has engineers and the good ones can spot trash solutions.

    If you users are naming their files with strange characters in them (assuming it's not due to Samba) then they will just have to live with it, you won't have time to sort out all the wierd names that (mostly MS-Word) users give to their filenames. The primary objective should be to give your users access to the files. Making the directory listing pretty ought to be a secondary concern.

  • Mindoka (http://www.mindoka.com) has a document management product that is designed to solve the problem that you have.

  • FileNet (Score:5, Interesting)

    by Ohio Calvinist ( 895750 ) on Wednesday June 10, 2009 @05:50PM (#28285985)
    I worked at a place that used FileNet [filenet.com], which is now an IBM product, to do this sort of thing. We had millions of scanned documents in the system. I wasn't personally very impressed with it, in that whenever anything "bad" happened, you had to call IBM because finding support online was impossible, and at that they support wasn't very good. It was also a very picky system, those seemed to handle the load well. If you go with it, I strongly encourage doing it for UNIX/Oracle because it screamed "poorly ported" when we used it for Windows/MSSSQL. It has an API for integration, but it is also, poorly documented and would take some time to integrate into your existing business systems.

    This is more of a rant at this point, but it is a stop-gap solution that allows people to continue to use outdated business processes storing important data in image formats or in documents scattered about with minimal indexing/search capabilities, rather than analyzable "data" that can lead to "information." I always take the position that if the goal is something on paper, or the goal is to store something that "was" on paper, it is time to rethink the business process to see if we can automate it, or store/present the data electronically in the first place. The old school fights against it, but no one has ever been able to say it wasn't more efficent in the end and enabled IT to say "yes we can" when the next great idea came along versus "here is a stack of papers, figure out $trend."
  • by Vroom_Vroom ( 29347 ) on Wednesday June 10, 2009 @05:50PM (#28285987)

    Hire a document manager / clerk person who will create order. Your engineers won't.

  • Can't really suggest a good document management program but I can tell you one to avoid. We use Livelink at my place of work and its indexing and search capabilities are horrible (some would say non-existent). For example every document added to Livelink gets a document number assigned to it. One would expect to be able to retrieve that document by using the same document number but if you enter it into the search bar Livelink returns no results found. Huh? Not to mention some odd UI behaviours like when

  • What kind of documents are they? If they're mostly text and you want versioning, the only drawback to subversion is getting people to learn the tools, but that might be too much.

    If they're archival/static documents, an institutional repository could work. Something like DSpace isn't that hard to deploy and will provide basic archival and search features.

    The middle ground between those two solutions is probably what you want, though. Everyone I work with uses SharePoint for that, and I hate recommending prop

  • Laserfiche (Score:2, Informative)

    by wguy00 ( 985922 )
    Laserfiche (or LF) is just what this is for. It is DOD, DOJ certified and crap, and is used by all branches of the military and several other areas of the government as their document management system. With several different software offerings, just about any situation can be taken care of. It's features include the ability to search based on document name, template information, or OCR'd text (which the software also takes care of). With add-on features such as Quick Fields, it may be able to automatic
  • As may have been pointed out, organizing the files is really the best way. Develop a strict schema for naming conventions as well as a hierarchical directory structure for maintaining and organizing. Something like:

    /projectname/projectpart/data (contains the final draft of any document) /projectname/projectpart/working (contains files that people are modifying so that they can be merged/checked in to the data dir) /projectname/projecttpart/misc (contains misc. notes or files that need to be filed with th
  • Who else read this and thought... working in a satellite office for an aerospace company would involve a lot of cool travel perks?

    -- Terry

  • by ak_hepcat ( 468765 ) <slashdotNO@SPAMakhepcat.com> on Wednesday June 10, 2009 @05:58PM (#28286115) Homepage Journal

    Odd that the next story has a great idea for document management right in the summary...

    Hadoop!

  • ...seems like a natural solution for your connectivity issues, or perhaps whatever the open source variety of Sharepoint is. You really do need to tackle the naming convention question though. You can have all the file indexing you want, but sometimes a nice, logical, clean file name will get you what you're after much faster than any kind of searching.

    It's going to be horrible, painful, thankless work that will put you on the shit list of just about every department manager and administrative assistant ("Y

  • When I worked for the state Attorney General's office as I.T. Director a request came into I.T. that immediately gave me an upset stomach. The request was for all documents on the server that contained the word "lead" as in the chemical element Pb. The issue was that the word lead and the element share the same spelling.

    I kicked in and wrote an app that generated a web list on the fly and had clickable links so the documents could be examined and then marked as part of discovery.

    I also brought in three
  • Hire a real librarian, it's what they do.
    On the plus side, you also get to hire a librarian. nudge, nudge, wink, wink, say no more.
  • Alfresco of course! (Score:3, Interesting)

    by thule ( 9041 ) on Wednesday June 10, 2009 @06:14PM (#28286297) Homepage

    It can scale extremely well. It is the backend to Adobe's acrobat.com website! So you know it can handle millions of documents if you need it to. Sharepoint requires MS SQL Server for searching documents. With Alfresco, that feature is built in.

    Sharepoint is teaming software and not really designed for large document repositories. Alfresco has a teaming interface (Alfresco Share) and a more generic document repository interface.

    Alfresco can expose the repository via FTP, SMB, WebDAV, and a web client interface.

  • WIKI (Score:3, Interesting)

    by unum15 ( 695402 ) on Wednesday June 10, 2009 @06:15PM (#28286313) Homepage
    Maybe not the best solution for this particular job, but man am I glad we started using Dokuwiki for all our scattered documents.
  • by mrmeval ( 662166 ) <jcmeval@NoSPAM.yahoo.com> on Wednesday June 10, 2009 @06:16PM (#28286315) Journal

    http://en.wikipedia.org/wiki/Document_management_system [wikipedia.org]

    For that level of documentation you need to have a staff and get it properly indexed. You need a high level librarian. This would be someone with a masters degree at minimum in library science and at least a bachelors in information technology. They will not come cheap and they are a long term investment. The software is available, it is not trivial. Hiring a large number of people to recategorize and tag all the documents for the length of time that takes is also an expense but worth it. Once it's all in place maintaining it gets much easier.

    I've seen a system developed for Raytheon. They took all the old compartmentalized data Hughes had and put every scrap of paper through a scanner. It was exceptionally well done. This would display electronic files and would have the location of hard copy. Classified documents were in some cases indexed but were hard copy only afaik. There were some documents that were hard copy only, those were usually ones with an NDA or other restriction on making electronic copies. It had every thing mentioned wrt versioning and such. Documents spanned decades with hundreds of revisions and you could pull up and view any revision. Depending on how recent and what type of document you could view a change log. Older scanned ones did not have that unless they'd been important enough to reenter as modern documents which meant OCR or manually transcribed. Some schematics were reentered into the system in a modern format. The effort was worth it. Having that data is the only way some devices or parts could be made or repaired.

    http://en.wikipedia.org/wiki/Document_management_system [wikipedia.org]

    • I'm surprised that there were quite a few programs not mentions on the DMS wikipedia page -- People might consider them to be more as repository software than DMS (or RMS), but some other ones to mention that would be useful to managing already existing documents:

      And if you're looking for librarians with an IT background, in the libraries they're called "Systems Librarians". You might also check out the oss4lib [oss4lib.org] and code4lib [code4lib.org] communities.

  • by DerekLyons ( 302214 ) <fairwater@@@gmail...com> on Wednesday June 10, 2009 @06:21PM (#28286357) Homepage

    It's called an index or a bibliography. There exists a profession known as 'librarian' specifically trained in the creation of such and in the management of large numbers of documents.

  • We went through this for both document management and web front end for access. We looked through, Sharepoint, Alfresco, Oracle UCM, Reddot and a few others. We dropped most due to cost, functionality, and ease of use for non-developers to do page work. Sharepoint was dropped due to cost in an internet setting (CALs), no non-developer front end for page layout (they couldn't use HTML) and it stores everything in the database. From prior experience this made backup/restore difficult as it keeps the IP of
  • Google?
    http://www.google.com.au/enterprise/mini/index.html [google.com.au]

    Seriously, if you can't be bothered collecting/maintaining the metadata that more structured solutions require, then just let Google index the lot. It'll work just as well (or not) as it does on the Internet. Although its not free it seems reasonably priced. It could be a quick answer to your problem.

  • Take a look at network-based WAN acceleration products that will significantly reduce the overhead of SMB/CIFS traffic. This will make it easier to index, cache frequently used documents locally and improve your WAN utilization company wide. It will even cache directory lookups and they will "feel" instant to the end user.

    A good example is Cisco WAAS, a cool video showing how it works is here: http://www.cisco.com/cdc_content_elements/flash/ans/index.html [cisco.com]

    See here for data sheets and specs: http://www [cisco.com]

  • We're using a Win3.1 app called LaserFiche on XP with > 250,000 documents and it's lightning fast, works with TIFF files and PDF and probably more. Includes file and folder permissions.

  • Check out Thunderstone. [thunderstone.com] It's what they do, and they do it very well.
  • Documentum, docushare, livelink, sharepoint. I've heard of documentum installs with 100m+ docs. It's quite good, but expensive.
  • by sexconker ( 1179573 ) on Wednesday June 10, 2009 @06:52PM (#28286705)

    It's becoming quite a mess, sometimes quite slow, and there is really no naming or numbering convention in place for the files and directories. We end up with mixed casing, all uppercase, all lowercase, dashes and ampersands in the file names, and there are literally hundreds of directories to sort through before you can find the document you are looking for.

    Slow. Upgrade your network and VPN. You know that VPN layer is just killing your performance.

    No naming or numbering convention. Get one.

    Mixed casing. Learn How to Properly Case Folders (and documents).

    Dashes and ampersands. Are they a problem? Aesthetically unpleasant? I personally restrict punctuation in a filesystem to dashes, periods, and parenthesis (unless the punctuation is a replicable part of the name of the file/folder).

    Examples:
    01 - The First Track (vocal)
    02 - $lashhvertisements Attack!
    03 - Where Have All the A.C.'s Gone

    Develop your own method that works and be obsessed about it to the point where you would reburn a disc if one of the filenames was "01-Name" instead of "01 - Name".

    Hundreds of directories.
    Each file should have it's own folder.
    "That's insane!" you say. Start out with this mentality. If there is no reason at all to separate two files (they are part of the same thing) then place them in one folder, and make sure the folder is named all-encompasingly. Repeat for all files. If you get into a AB, BC, but not ABC situation, the solution is to have A and B and C, with A and C linking to B with your choice of shortcut/link/symlink/etc.
    Do this until all files are in folders. Then repeat with folders.

    There is NO substitute for organization and getting people on the same page. Develop some conventions. Task people to fix as they go. Check up to make sure people accessing documents are fixing as they go, and doing so according to convention. Once people are used to the convention, and once things are relatively organized, they won't ever need to search again. They'll instantly know where 99% of things are, and will be able to dig around and find anything else within seconds.

    The main problem you face is getting organized after already being unorganized. It isn't easy, but at least you're not dealing with millions of paper documents.

    • By "replicable part of the name of the file/folder" I mean in regards to illegal characters in the filesystem/os. Windows claims these are ><\/:|*^?" for example (dunno if it's Windows, NTFS, NTFS+FAT+Whatever else windows needs to support).

      I didn't intend to do an example with /. references when I started. I wanted something showing the dollar sign, and then stuff with periods and a quote mark and a question mark (dropped). First stuff that came to mind. Had I planned it, or previewed my post, th

  • Obviously (Score:3, Funny)

    by kitsunewarlock ( 971818 ) on Wednesday June 10, 2009 @07:28PM (#28287067) Journal
    Obviously throw them on the desktop. Once it fills, throw them into a New Folder. Once your desktop fills with Folders, throw those in My Documents. Repeat until your computer crashes.

"If it ain't broke, don't fix it." - Bert Lantz

Working...