Follow Slashdot stories on Twitter


Forgot your password?

Slashdot videos: Now with more Slashdot!

  • View

  • Discuss

  • Share

We've improved Slashdot's video section; now you can view our video interviews, product close-ups and site visits with all the usual Slashdot options to comment, share, etc. No more walled garden! It's a work in progress -- we hope you'll check it out (Learn more about the recent updates).


Best Way to Build a Searchable Document Index? 216

Posted by ScuttleMonkey
from the build-a-better-boss-trap dept.
Blinocac writes "I am organizing the IT documentation for the agency I work for, and we would like to make a searchable document index that would render results based on meta tags placed in the documents, which include everything from Word files, HTML, Excel, Access, and PDF's." What methods or tools have others seen that work? Anything to avoid?
This discussion has been archived. No new comments can be posted.

Best Way to Build a Searchable Document Index?

Comments Filter:
  • by Anonymous Coward on Monday October 01, 2007 @05:38PM (#20816465)
    You're the one gettin' paid, you figure it out.
  • Lucene (Score:5, Informative)

    by v_1_r_u_5 (462399) on Monday October 01, 2007 @05:39PM (#20816485)
    Check out Apache's free Lucene engine, found at []. Lucene is a powerful indexing engine that handles all kinds of docs, and you can easily mod it to handle whatever it doesn't. It also allows custom scoring and a very powerful query language.
    • Re:Lucene (Score:5, Informative)

      by Anonymous Coward on Monday October 01, 2007 @05:43PM (#20816545)
      yes. It's hard to beat Lucene if you don't mind working at the API level. If you want a ready-build web crawler, check out Nutch, which is based on Lucene.
    • Re:Lucene (Score:4, Informative)

      by BoberFett (127537) on Monday October 01, 2007 @05:46PM (#20816583)
      I haven't used Lucene, but as for commercial software I've used dtSearch and ISYS and they are both excellent full text search engines. Both have web interfaces as well as desktop programs, and SDKs are available for custom applications. They scale to a massive number of documents per index, in a large variety of formats and are very fast. They have additional features like returning documents of all types in HTML so no reader is required on the front end other than a browser so legacy formats are easier to access.
      • by jafac (1449) on Monday October 01, 2007 @08:33PM (#20818005) Homepage
        About 10 years ago, I used a product called Folio, which was the same product Novell used for their "Novell Support Encyclopedia" - they had a great set of robust tools, including support filters for a wide enough variety of formats (and tools to write your own), and your data would all compile down to what they called an "Infobase" - which was a single indexed file containing full text, markup, graphics, and index - readable in a free downloadable reader with a pretty decent (for 1996) search engine, that did boolean, stem, and proximity searches.

        It was very convenient to be able to give to either a reseller, or high-end customer, this single-file, containing reasonably up-to-date support information on our product that they could search on. It was probably the #1 thing our tech support department spent money on that reduced call volume. Then we got bought, and the new tech support director, an IBM guy, replaced everything with Lotus Notes - (and other headcount-increasing, empire building tools).

        The Folio company was bought up and I think the technology is now used by the company that does Lexis-Nexis.

        For now, my current employer is using Lucene.
        • by BoberFett (127537) on Monday October 01, 2007 @09:03PM (#20818235)
          Yep, Folio Views is another one, but I have no personal experience with it. I wrote two commercial software packages (CD and internet legal research systems) using ISYS and dtSearch, and I'm familiar with Folio because Lexis was a competitor of the company I worked for. I'm not sure what it's complete capabilities are, but I have to imagine it's comparable.
    • Re:Lucene (Score:5, Informative)

      by knewter (62953) <> on Monday October 01, 2007 @05:54PM (#20816659) Homepage
      Lucene's good. If you haven't yet have a look at Ferret, a port of Lucene for Ruby. It's listed as faster than Lucene. I've used it in 20+ projects now as my built-in fulltext index of choice, and it's pretty great. You can easily define your own ranking algorithms if you'd like. You can find more information on Ferret here: []

      I've got a prototype of the system described in the OP that we did while quoting a fairly large project. It's really easy to have an 'after upload' action that'll push the document through strings (or some other third party app that can operate similarly, given the document type) and throw the strings into a field that gets indexed as well. That pretty much handles everything you may need.

      Obviously I'd also allow someone to specify keywords when uploading a document, but if this engine's going to just be thrown against an existing cache of documents, strings-only's the way to go.
    • by Dadoo (899435) on Monday October 01, 2007 @07:04PM (#20817293) Journal
      I checked out the documentation on Lucene, and it appears to be designed for searching the documents on a few web servers.

      In my situation, I've got a couple dozen servers (mostly Windows, but some Linux), and maybe 8TB of data, mostly in Word documents, Excel spreadsheets, etc. Can Lucene (or Nutch) scale up to something like that? I'd also like it to search Windows network drives. Is that possible?
      • Re:Lucene (Score:3, Informative)

        by dilute (74234) on Monday October 01, 2007 @09:00PM (#20818197)
        Lucene is strictly an indexing engine. It wants to index text. It can index metadata as well as full text. Your surrounding application gets the files to index from wherever (local hard drive, database BLOBS, remote Windows shares or what have you). We don't care if the files are Word, PDF, Powerpoint, HTML, or whatever. A parser (many free ones available) extracts the text. We also don't care what web server you are using - using the index to identify and retrieve files is a totally separate process. Lucene indexes the text stream one-by-one and stores the results in a very efficiently organized index. It has been ported to a bunch of languages, including dotNET. I haven't tried it on terabytes of data but it rips through gigabytes very fast. Assuming all 8 terabytes don't change between runs the scale should be no problem.

        If you needed to run, say 100 indexing engines in parallel and merge the indexes, you'd have to research that. Somebody's probably done it.
        • Re:Lucene (Score:5, Informative)

          by passthecrackpipe (598773) * <> on Tuesday October 02, 2007 @02:39AM (#20820209)
          "If you needed to run, say 100 indexing engines in parallel and merge the indexes, you'd have to research that. Somebody's probably done it."

          Yes, they have. In my previous job we had to search 2 terrabytes of plain text data (HTML) really fast. The company chose Autonomy, and many developers spent many months trying to make it work, consuming insane amounts of hardware resources for mediocre results, and still requiring . One lone (and brilliant) dev whipped up a Lucene proof of concept in a weekend, and it was faster (full index in a day) required less resources (a single HP DL 585, 16GB RAM, 4xdual core AMD as opposed to 10 of the same), had a smaller index (about a 5th of Autonomies'), returned results faster, the result set was more accurate, and was significantly more flexible in making it do what we actually needed it to do.

          Lucene wins hands down
      • Re:Lucene (Score:3, Interesting)

        by bwt (68845) on Tuesday October 02, 2007 @01:03AM (#20819733) Homepage
        Dealing with large data sets isn't really technologically challenging. You can grow to an arbitrarily large data set size simply by partitioning: it works for google. It may be expensive to stand up a bunch of servers, but I don't think it's really that hard.

        What is more complicated is to deal with large numbers of concurrent requests. Then you need clustering. There are big sites that do both partitioning and clustering simultaneously with Lucene. I seem to recall reading that Technorati uses Lucene on a cluster of 40 servers. With 8TB of data you are going to have a big one time computational problem to build the index. Lucene can recombine indexes, so you could distribute index creation to a lot of servers.
    • Re:Lucene (Score:3, Informative)

      by Cato (8296) on Tuesday October 02, 2007 @01:37AM (#20819899)
      You might also like to investigate Plucene and KinoSearch which are both Perl ports of the Lucene engine. It's also worth considering combining your search engine with Wikis where possible - then you can find documents by keywords and also navigate to a Wiki page providing context and ralted documents or Wiki pages. TWiki, which is the most popular open source enterprise Wiki engine, has plugins for both these engines, see []
    • Search software (Score:3, Informative)

      by j.leidner (642936) <> on Tuesday October 02, 2007 @07:34AM (#20821307) Homepage Journal
      Lucene - LINK []

      Terrier - LINK []

      Indri/Lemur - LINK [] / LINK []

      MG - LINK []

  • Google (Score:3, Informative)

    by Anonymous Coward on Monday October 01, 2007 @05:40PM (#20816493)
    We have a Google appliance, but you can do it with regular Google, too. Just make sure you disable caching (with headers or by encrypting documents). Then place an IP or password restriction for non-Google crawlers (check IP, not user-agent). People will be able to search with the power of Google, but only people you allow in will be able to get the full documents.

    If you value your privacy, invest in a Google mini, though.
    • Re:Google (Score:5, Informative)

      by rta (559125) on Monday October 01, 2007 @05:49PM (#20816607)
      Previous place i worked we had a Google Mini and it was better than anything we had come up with in-house.

      We even pointed it at the web-cvs server and bugzilla and it was great at searching those too.

      To see all the bugs still open against v 2.2.1 or something like that bugzilla's own search was better. but for searching for "bugs about X" the google mini was great.

      It only cost something like $3k ircc.

      not exactly what you asked about, but you should definitely see if this wouldn't work for you instead.
      • Re:Google (Score:3, Interesting)

        by TooMuchToDo (882796) on Monday October 01, 2007 @06:20PM (#20816867)
        Seconded. I've done implementations (hosted an in-house) of both Google Minis as well as the full blown Enterprise appliances. They are amazing creatures. I would recommend the Mini to almost anyone, while the Enterprise costs a pretty penny.
        • Re:Google (Score:4, Informative)

          by shepmaster (319234) on Monday October 01, 2007 @08:57PM (#20818185) Homepage Journal
          The company I work for, Vivisimo [], makes an awesome search engine. Although I've never dealt with the Google box directly, I know that we have had customers get fed up with the Google box and replace it quite easily with our software. Click the first link to see a pretty flash demo, or go to [] to try out a subset of the functionality for real. We specialize in "complex, heterogenous search solutions", which exactly fits most intranet sites I've seen. Files are on SMB shares, local disks, Sharepoint, Lotus, Documentum, IMAP, Exchange, etc, etc, etc. We connect to all those sources and provide a unified interface. You can do really neat tricks with combining content across multiple repositories, such as metadata from a database added to files on SMB shares. We support Linux, Solaris, and Windows, all 32 and 64 bit. Although I may work here, it really is a great product, and I use it at home to crawl my email archives and various blogs, websites, forums, things that I use frequently but have sucky search.
    • Re:Google (Score:2, Insightful)

      by rgaginol (950787) on Monday October 01, 2007 @09:48PM (#20818601)
      I'd have to agree - I'm a Java developer so if I was doing the solution, I'm sure I could whip up something cool with Lucene or whatever. But... in terms of long term maintanance costs, why develop anything yourself if the problem is already solved. And on the point of cool: good IT systems aren't cool... they do a job and do it well... maybe this is the first project where you find a "cool" solution is just not justifiable. I'm sure the Google appliance would let you put some quite extensive customizations on top of their API... well, that's been my experience with other Google products/services (Google Web Toolkit or Google Maps). Still, I've also found that some of the "nice to have" API's are kept out of reach with some things - I was trying to put a listener on a custom tile image in a map application and suddenly came up against a the barrier of "oh-boy-we'll-let-you-play-but-don't-dare-touch-that" sure came to mind. I guess the only way forward is to scope your requirements well - and I'll bet half of the "must haves" aren't really that important. After that, some research on each of the possible solutions would be good and the cost to implement/maintain them. If you've got programming expertise in house, it may be tempting to use them as a no brainer, but it is worth finding out their cost to implement a good solution (one with maintenance and documentation factored in... basically, whatever amount of time they say it will take to code, times 4).
  • Meta tags placed? (Score:4, Insightful)

    by harmonica (29841) on Monday October 01, 2007 @05:40PM (#20816497)
    Who places what types of meta tags in the documents? I don't understand the requirements.

    Generally, Lucene [] does a good job. It's easy to learn and performance was fine for me and my data (~ 2 GB of textual documents).
    • by Anonymous Coward on Monday October 01, 2007 @06:35PM (#20817029)
      Meta tags are worthless, generally, unless you have a librarian who ensures correctness.
      I've worked in electronic document management in 3 different businesses and metadata entered by end users is worst than worthless - it is wrong. Searches that don't use full text for general documents are less than ideal.

      Just to prove that you're question is missing critical data:
        - how many documents?
        - how large is the average and largest documents?
        - what format will be input? PDF, HTML, XLS, PPT, OO, C++, what?
        - what search tools do you use elsewhere?
        - any budget constraints?
        - did you look at general document management systems? Documentum, Docushare, Filenet, Sharepoint? If so, what didn't work with these systems?
        - Did you consider OSS solutions? htdig, e-swish, custom searching?
        - A buddy of mine wrote an article on "how to index anything" that was in the Linux Journal a few years ago. Google is your friend.

      AND if i didn't get this across yet - DON'T TRUST META DATA IN HIDDEN DOCUMENT FIELDS - bad Metadata in MS-Office files will completely destroy the usefulness of your searches.
      • by Keith_Beef (166050) on Tuesday October 02, 2007 @09:17AM (#20821857)

        Meta tags are worthless, generally, unless you have a librarian who ensures correctness.


        I've worked in electronic document management in 3 different businesses and metadata entered by end users is worst than worthless - it is wrong. Searches that don't use full text for general documents are less than ideal.

        Unless you can pin responsibility for a document to a named person, you can't trust anything in the document. Not metadata, not content, not presentation.

        The meta tags most of the documents I deal with are inserted by the applications, and only the content is human-drafted. Those meta tags contain information like creation date, mdification date, application name, character encoding, etc. They are generally trustworthy.

        I'm also in the process of building a documentation system; it will be a set of documents in various formats, with an HTML interface, TomCat server and Lucene to make it fully searchable.

        In a previous job, I did a similar thing with Apache and ht://dig on an old Dell I recycled. Document files could be uploaded by anybody with an FTP account on the server, and index files were automatically regenerated by a CRON task at 04h00 each day.
        I could have made a trigger to regenerate the index after each FTP upload session, but using CRON was easier and sufficiently frequent to be useful.

        This time around, the whole system of TomCat webserver and Lucene search engine is bundled on a CD-ROM with the docs to run on any of the firm's laptops. Because I control the documents, I can build the index files and burn them to the CD-ROM before distribution.


    • by rainmayun (842754) on Monday October 01, 2007 @06:43PM (#20817123)

      I don't understand the requirements.

      I don't either, and that's because the submitter didn't give enough information. I'm working on a fairly large enterprise content management system for the feds (think 2.5 TB/month of new data), and I don't see any of the solution components we use mentioned in any thread yet. If I were being a responsible consultant, I'd want to know the answers to the following questions at minimum before making any recommendations:

      • What is the budget?
      • How many documents are we talking about? The answer for 10,000 is different than for 10,000,000.
      • Are you looking for off-the-shelf, or is software development + integration going to be involved
      • Who is going to maintain the integrity of this data?

      Although I am as much a fan of open source as anybody, I don't think the offerings in this area are anywhere near the maturity of commercial offerings. But some of those offerings cost a pretty penny, so it might be worthwhile to hire a developer or two for a few weeks or months to get what you want.

  • by wsanders (114993) on Monday October 01, 2007 @05:40PM (#20816507) Homepage
    Because if you have to spend more than an hour on this kind of project nowadays, you're wasting your time.

    The inexpensive Google appliacances don't have very fine-grained access control, though. But I am involved in several semi-failed projects of this nature in my organization, but new and legacy, and my Google Desktop outperforms all of them.
    • by sootman (158191) on Monday October 01, 2007 @06:13PM (#20816807) Homepage Journal
      Since Google Desktop works by running its own little webserver, can you install Google Desktop on a server and access it by visiting http://server.ip.address:4664/ [server.ip.address] ? (I'm at work and my only Windows box has its firewall options set by group policy.)
    • by NoNeeeed (157503) <> on Monday October 01, 2007 @06:29PM (#20816967) Homepage
      Yep, a Google appliance (or equivalent, there are others on the market such as X1) is the way to go.

      I set up a Google Mini for indexing an internal wiki, our bug tracking system, and some other systems, and it is very straight-forward.

      I know the original question mentioned meta-data, but you have to ask yourself if the meta-data is going to be maintained well enough that the search index will be valid. Going the Google Appliance route is so much simpler. It takes a bit of tweaking to set up the search restrictions, but once up and running, it works flawlessly. Most importantly, it doesn't require everyone to make sure that all their document meta-data is perfect.

      Google appliance pricing is really quite cheap when you compare it to the time cost of setting up a meta-data driven system.

      Meta-data is one of those things that seems like a really good idea, but like all plans, doesn't tend to survive contact with the enemy, which in this case is the user.

      • by anthonys_junk (1110393) <> on Monday October 01, 2007 @11:25PM (#20819187)

        Meta-data is one of those things that seems like a really good idea, but like all plans, doesn't tend to survive contact with the enemy, which in this case is the user.

        Like software development, the quality of the outcome is implementation dependent.

        We run six thesauri, plus a number of different controlled lists for our users to input metadata. We don't publish any documents that don't have meta attached, and we perform random quality audits. Our users have been trained in fundamentals of classification and also in the payoff for getting it right.

        We use metadata to structure our navigation for some sites, we depend on it for search for our internal documents. Our metadata implementation works incredibly well for us; clearly and consistently outperforming plain text indexing.

    • by PIPBoy3000 (619296) on Monday October 01, 2007 @06:33PM (#20817013)
      We've been quite happy with our Google Search Applance.

      The two exceptions are the way it handles secured documents (on our mostly-Windows network, that meant authenticating twice or doing complicated Kerberos stuff), and hardware (we've had two boxes fail with drive issues in the last year).

      Still, when it comes to search results and speed, it's been very good. I'm also a fan of Google Desktop, but that's a completely different story and more difficult to centrally manage.
    • by shepmaster (319234) on Monday October 01, 2007 @09:13PM (#20818297) Homepage Journal
      If security is important, you should take a look at Vivisimo Velocity []. We offer access control down to the content level. Have a single result document with a title, a snippet, and a piece of sensitive info like money amounts? You can make it so that only select users (from LDAP, or AD, or wahtever system) can see the sensitive information, but everyone can see the title and snippet. We also respect the security restrictions from our content sources, such as Windows fileshares or Documentum. And I've heard from customers that we have easily replaced their Google Appliances, and we can be installed on any commodity linux/solaris/windows box.
  • Swish-E (Score:3, Informative)

    by ccandreva (409807) <> on Monday October 01, 2007 @05:41PM (#20816517) Homepage
  • Beagle, Spotlight? (Score:3, Insightful)

    by Lord Satri (609291) <> on Monday October 01, 2007 @05:43PM (#20816541) Homepage Journal
    Is this something that would suit your needs: Beagle for Linux [], Spotlight for OSX []? I haven't tried Beagle (I don't have root access on my Debian installation at work), but Spotlight is probably my most cherished feature in OSX... it's so useful.
  • Since you're indexing non-text data, you'll need a search engine that has plenty of document filters. We use Oracle Text to do something similar to this, but it's not for the faint of heart. The nice thing about Oracle Text is it includes filters for pretty much any document you'd want to index (PDF, Word, Excel, etc). Of course, Oracle Text query syntax needs an awful lot of lipstick to be made to look like Google query syntax. WMMV.
  • by mind21_98 (18647) on Monday October 01, 2007 @05:50PM (#20816611) Homepage Journal
    If you're using Office 2007, you can probably hack something together really quickly to pull the meta tags from the files and put them in a database. Not sure about the other formats you need, though--and support from Google, for instance, would probably be beneficial for your company anyway. Hope that helps!
  • by Gwala (309968) <(adam) (at) (> on Monday October 01, 2007 @05:50PM (#20816619) Homepage
    A googlebox. Indexes file shares and internal websites and makes them searchable. Can be a little pricey though.
  • what to avoid (Score:2, Insightful)

    by Anonymous Coward on Monday October 01, 2007 @05:52PM (#20816631)
    You should avoid any system that relies on individual employees putting in these meta-tags. It won't work; they either won't do it, or will do it wrong (spelling errors, inventing their own tags on the fly, and so on.) And then you'll catch hell when they can't find one of those documents they mislabled. Trust me.
  • Most easy solution (Score:4, Informative)

    by PermanentMarker (916408) on Monday October 01, 2007 @05:54PM (#20816653) Homepage Journal
    it wil cost you some bucks just buy MS sharepoint portal server, and leave the indexing over to sharepoint.
    Your not even realy required to use added tags... (as most people will put in poor tags).

    But if you like you can add tags even with sharepoint.
    • by DigitalSorceress (156609) on Monday October 01, 2007 @07:23PM (#20817455)
      Actually, if you are an MS shop and have Microsoft Server 2003, SharePoint Services 3.0 (as opposed to the SharePoint Portal server (now renamed, I believe, to Microsoft Office SharePoint Server) which does indeed cost a packet.

      I do a lot of LAMP development, and I'm not the strongest fan of Microsoft for a lot of things, but if you have a MS desktop and MS Office environment, SharePoint services really is quite decent for INTRANET applications. Especially for collaberation. You can set up work flows for check-out/check-in, and it integrates really nicely with some of the more recent MS Office releases. If you connect it to a real MS SQL server on the back end (as opposed to the express edition that it defaults to), you can have full text indexing even with the free SharePoint Services version. Only need for the full blown Portal/MOSS version is if you think you are going to have a large number of sharePoint sites, and want to simplify cross-connecting and management. (At least as far as I can recall)

      I'm not saying SharePoint is the way to go, but I'd at least read up on it and consider it IF you have a lot of MS Office stuff that you plan on indexing/sharing.

      I'd strongly advise avoiding it if you plan to do Internet-based stuff though... at lest until you get a good enough understanding of the security issues involved that you feel that you really know what you're doing.

      Just my $0.02 worth.
      • by smutt (35184) on Tuesday October 02, 2007 @03:23AM (#20820371)
        I have one problem that I've learned from my interaction with SharePoint. It doesn't scale. Period end of story. It works great when you have 1000 users or less accessing it concurrently. But try going higher and you're in for a world of hurt.
      • by perky (106880) on Tuesday October 02, 2007 @05:02AM (#20820727)
        Yep - totally agree. If you are a MS shop, and have Server 2003 infrastructure (which you will), then WSS 3 is a free download. I have had great success with it for replacing a network file share for document sharing, and for replacing "Tracking" spreadsheets with Sharepoint lists.

        It has basic sarch built in, and there is an upgrade path. IFilters are built in for the MS formats, and there is third party for non-MS formats.

        One thing - Office Sharepoint designer essentially doesn't work. Just don't bother with it.
    • by JacobO (41895) on Monday October 01, 2007 @08:15PM (#20817847)

      I've never seen a SharePoint site where the search worked well at all (particularly in the document libraries.) You might think by its observable behavior that it is simply offering up documents at random instead of searching their contents.
    • by guruevi (827432) <{eb.ebucgnikoms} {ta} {ive}> on Monday October 01, 2007 @11:21PM (#20819151) Homepage
      I am just done with 6 months of SharePoint integration in an MS shop. From a development and security standpoint: STAY AWAY FROM IT. 2003 seems to be an Alpha version, 2007 is still full of bugs (better than Beta but still) and it's also very, very slow (it's based on .NET). To work fairly good for 100 users it requires 1 SQL Server (MSDE will not work for non-development purposes), 2 Frontends and 1 loadbalancer/firewall based on Microsoft Forefront just for security purposes (since the built-in SharePoint requires a lot of stuff to be opened to all users). It's also expensive... 10k/server + CAL's for every user and that was in a big MS shop.

      Next to that, there are a lot of caveats and as soon as you start modifying the layout (even though it's just the HTML) in SharePoint Designer, Microsoft Support will not help you (as if they could in the first place). Simple things like whitespaces in-between table structures can make your list workflows screw up (yes there is an actual opening with Microsoft Support for that very issue). A lot of things will not work either and require a nasty hack or workaround (like attachment upload on modified forms) and are known with Microsoft and have been known for the last 9 months.
    • by big ben bullet (771673) on Tuesday October 02, 2007 @03:51AM (#20820489) Homepage
      (Sorry, I don't like sharepoint at all!)
      Or build your own app on top of a Microsoft SQL Server 2005 with Full Text Search
      Technet Article []

      No need for tags... let the document itself be the tags...

      The free-as-in-beer express ed. (with advanced blahblah..) however is limited to 2Gb. So you will at the least need a Standard ed. though.

  • by omgamibig (977963) on Monday October 01, 2007 @05:57PM (#20816685)
    Let google do the indexing!
  • Check out Alfresco! (Score:3, Interesting)

    by thule (9041) on Monday October 01, 2007 @06:01PM (#20816713) Homepage
    I posted this before on slashdot. I discovered a while ago a cool system called Alfresco []. There is a free (as is liberty) and commercial versions. It acts like a SMB (like SAMBA), ftp, and WebDAV server so you don't have to use the web interface to get files into the system. Users can map it as a network drive. The web interface allows users to set metatags, retrieve previous versions of the file, and most importantly, search the documents in the system.

    Alfresco also has plugins for Microsoft Office so you can manage the repository from Word, etc. They are also working on OpenOffice integration.

    Don't use SAMBA for .doc and .pdf's, use Alfresco.

    I am not affiliated with Alfresco, just a happy user.
    • by dsgfh (517540) on Monday October 01, 2007 @10:13PM (#20818785)
      The parent here speaks the truth.
      I notice a lot of the comments in the thread are coming from developers or sysadmins who want to solve everything with libraries or command line tools. But it really sounds to me like you need a reasonable document management system (and of course being a slashdot reader you want it for free).

      Again, I'm not affiliated with Alfresco, but did quite a bit of research into open source DMS's that would run in a java environment for a couple of recent projects. I found Alfresco to be well architected, easily extendible if I needed it to be and importantly simple to deploy & get running. It will integrate with your LDAP for access and while it's marketed as an Enterprise CMS, is quite capable of doing DMS.

      It uses Lucene under the hood, and while it has a web UI, isn't focused on indexing web sites. You can record meta-data against docs, and it's also capable of extracting some metadata from common MS Office formats. I've no doubt this could be extended if there were other doc properties you wanted access to (although I've never tried myself).

      Most importantly is that the project & community is quite healthy with very active forums. You can get paid support (the Enterprise License) if you so desire, but I expect you'd probably start with the GPL version just to get yourself up & running.

      I wouldn't recommend the SMB interface for the time being as there's currently an outstanding bug with it that causes it to die after a while (the rest of the app continues to run happily), however the FTP interface is great for an initial import of docs. Also take a look at the rules capability for classifying/sorting docs as they're imported.

      It does the basics like check-in/check-out & workflow, and can be backed by your DB of choice as it uses Hibernate for ORM. Searching can be done against keywords or meta-data (classifications, dates, authors etc) & in my experience is more powerful/useful than sharepoints keyword based searching. If you're really keen you can use the Java or Web Service API's for integrating into other solutions.

      Again, I'm not affiliated, but clearly I'm a fan-boy :) I'd recommend installing the base Alfresco Community release (no need for Web Content Management, Records Management etc to start with), loading some docs into it via the FTP interface (or upload a zip via the web interface which it will explode out for you) & giving it a test run. I've got people asking me every couple of days when we're rolling it out internally (just got to finish the sharepoint comparison first).
  • Also see Xapian (Score:5, Informative)

    by dmeranda (120061) on Monday October 01, 2007 @06:12PM (#20816793) Homepage
    I'd suggest you should consider a full-text search engine. First start here: []

    If you're not afraid to do a little reading and potentially coding a custom front end, you may want to look at two of the big open source engines: Lucene and Xapian.

    Lucene is quite popular now, and is an Apache Java project. It's a good choice if you're a Java shop.

    Xapian seems to be based on a little more solid and modern information retrieval theory and is incredibly scalable and fast. It's written in C++, with SWIG-based front ends to many languages. It might not have as polished of a front end or as fancy of a website as Lucene, but I believe it's a better choice if you have really really huge data sets or want to venture outside the Java universe.

    There are also many other wholely-contained indexers too, mostly which are based on web indexing (they have spiders, query forms, etc.) all bundled together. Like ht://Dig, mnogosearch, and so forth. They are good, especially if you want more of a drop-in solution rather than a raw indexing engine, and if you're indexing web sites (and not complex entities like databases, etc).
    • by mikeboone (163222) on Monday October 01, 2007 @08:20PM (#20817903) Homepage Journal
      I've had good luck for several years using Xapian integrated with PHP. It did take some work to integrate but it's fast and flexible.
    • Re:Also see Xapian (Score:2, Informative)

      by risk one (1013529) on Monday October 01, 2007 @08:32PM (#20817999)

      I agree that Lucene is a great choice specifically for java shops, it does have ports for pretty much all major languages. The java implementation is the 'mothership' but you can use lucene with php, python, .NET or C++ or whatever.

      Secondly, I'd like to point out Lemur []. It's an indexing engine similar to Lucene, but geared much more toward the language modeling approach of information retrieval. All IR approaches will use either a vector space based approach or a language model approach. Lucene does vector space very well, but it's difficult to get it to do language model based retrieval (although extensions are available), Lemur can do both. Lemur also has Indri, a search engine written on top of Lemur, which can parse html, PDF and xml. And like Lucene, Lemur has multiple language ports of the API.

      A final point I would like to make is that IR is a very actively researched field. If you're going to do your own coding (specifically the retrieval model), I suggest you buy a book and get reading. Most of the basic problems (and there are many) have been figured out and it'll save you a lot of trouble if you just read up on how to update an index or find spelling suggestions, instead of figuring it out for yourself. It's possible to index your documents with Lucene and run searches on them in half an afternoon, but it takes some basic knowledge to get it right, and make the app useful. (Look at wikipedia's search for an example of what you get when you don't follow through, and stop after it seems to work ok).

  • by Nefarious Wheel (628136) * on Monday October 01, 2007 @06:14PM (#20816815) Journal
    It depends on the size of your document base, and how you're going to store it -- if you're using something industry-strength like Documentum or Hummingbird then the Google Mini won't index it, you have to go up a notch and use the yellow box solutions. And if you're using Lotus Notes, you'll need a third party crawler such as C-Search. Google Desktop can be bent into some solutions, and it's free, but for many users you're better off having a separate server do the indexing. Google bills on the number of documents you need to keep in the index at once, and they throw in a bit of tinware to support that on a 2 year contract.

    Disclaimer: I flog Google search solutions at work, so I'm way biased.

    • by shepmaster (319234) on Monday October 01, 2007 @09:21PM (#20818367) Homepage Journal
      If you find yourself connecting to many different repositories, you should check out Vivisimo Velocity []. We have some awesome connectors to the most popular repositories:
      • Offers connectors to repositories such as file servers, MS SharePoint, Documentum, Lotus Notes, Exchange, Legato etc.
      • Crawls information in databases such as SQL Sever, MYSQL, PostgreSQL, DB2, Oracle and Sybase.
      • Supports many file formats including Microsoft Word, Excel, PowerPoint, WordPerfect, PDF, Postscript, email archives, XML, HTML, RTF and others.
      • Rich media including images, audio and video.
  • use a Wiki instead (Score:5, Insightful)

    by poopie (35416) on Monday October 01, 2007 @06:20PM (#20816871) Journal
    Directories full of random documents in random formats of random version with varying degrees of completeness and accuracy tend to get less useful as an information source as time goes on. Docs get abandoned and continue to provide outdated information and dead links. Doc formats change and require converters to import. Doc maintainers leave the company.

    If you work somewhere where people are not trained to attach Office docs to every email, where people don't use Word to compose 10 bullet points, where people don't use a spreadsheet as a substitute for all sorts of CRM and business applications... a Wiki is actually a good solution.

    You can use something like MediaWiki or Twiki or... heck you can use a whole variety of content management systems.

    The key to success is to *EMPOWER* people to actually update information, and have a few people who are empowered to actually edit, rehash, sort, move, prune wiki pages and content. As the content improves, it will draw in more users and more content creators. Pretty soon, employees will *COMPLAIN* when someone sends out information and doesn't update the wiki.

    Some corporate cultures are not wiki-friendly. Some management chains *fear* the wiki. Some companies have whole webmaster groups who believe it is their job to delay the process of getting useful content onto the web by controlling it. If you're in one of those companies... start up your own wiki and beg for forgiveness later.
    • by LadyLucky (546115) on Monday October 01, 2007 @08:12PM (#20817805) Homepage
      We've been using Confluence, from Atlassian [] for our wiki, and it's pretty fully featured for a wiki.
    • by jbarr (2233) on Monday October 01, 2007 @10:27PM (#20818851) Homepage
      A Wiki can be great, but the downside is that it requires the user(s) to actually enter the data into the Wiki. How would you handle disparate business documents with a Wiki? For example, how would you manage hundreds of Word, Excel, PowerPoint and .PDF documents? Documents created by multiple people over a long time in many locations? Wouldn't it be easier for this scenario to use an indexing tool? I've used Google Desktop to manage these types of documents, and it's amazingly effective. Wikis are absolutely wonderful going forward but they're cumbersome with existing, varying documents--cumbersome at least until the data contained in the documents is incorporated into the Wiki. Also , check out It's a single-file "personal Wiki". It's essentially a self-updating HTML file built around JavaScript. It's truly innovative. (Of course, it's for capturing small amounts of data, not creating large, multi-user repositories.)
  • by athloi (1075845) on Monday October 01, 2007 @06:26PM (#20816929) Homepage Journal
    Always write their own homebrew search engines [].
  • by drhamad (868567) on Monday October 01, 2007 @06:29PM (#20816961)
    How about IBM FileNet? Or are you looking for something free? We use FileNet everywhere I've been.

    The downside to the suggestions like Google Appliances is that you're then storing this information on Google servers.... something that most companies find HIGHLY objectionable (security).
  • by vondo (303621) on Monday October 01, 2007 @06:31PM (#20816985)
    DocDB ( can interface to the search engines others are suggesting, but organizing your documents with decent meta-data in the first place (and not on a Wiki that is allowed to rot) is also important. That's what DocDB does.
  • by hrieke (126185) on Monday October 01, 2007 @06:33PM (#20817015) Homepage
    So far, everything I've read really doesn't sound like it's geared towards an enterprise level; mostly put a bunch of files out there in a folder somewhere and let a crawler index them. That's all good and fine until someone gets the idea to search for the payroll documents...

    At work, and granted it's a fair size HMO, we use Universal Content Management by Oracle, formerly known as Stellent's Content Management System.
    UCM allows for named accounts with control over access, plus a full audit log, full change log (plus revisions), and a centralized location for searching for document of any type (Word, Excel, Powerpoint, AutoCAD, MPG movies, TIFF, etc.).

    Supports work flows as well, a nice plus if something needs to go through a formal process, gives nice audit trails, and supports 3 different full text indexes (FAST, Verity, and database) of the stored content.

    This might not be for everyone, but it is a decent tool for large size companies to manage documents.

  • by JimDaGeek (983925) on Monday October 01, 2007 @06:34PM (#20817017)
    Seriously, spend a tiny bit of money on a Google Appliance and get excellent search. I tried to use MS stuff, like the built-in index server and it just wasn't good enough.

    We got a Google "appliance" and the damn thing just works, and works well. I don't work for Google, nor do I get paid if they make a sale. Just saying what worked great for us.
  • by ecloud (3022) on Monday October 01, 2007 @06:46PM (#20817151) Homepage Journal
    I once integrated Swish++ as a document search system for a MediaWiki installation, to handle uploaded documents. I liked the results so then I started using it to build an index on a large codebase so I could quickly find all usages of a particular symbol (in source files, libraries and executables too). The catch is you have to define how to translate each type of file into plain text so it can be indexed. There are plenty of tools available for Word docs, PDFs, nm for libraries, etc. Compared to some others I think Swish++ has the advantage of speed. I haven't tried Lucene but my feeling is I'd rather not use Java for that unless the whole system is in Java.

    HyperEstraier has an excellent reputation but I haven't tried it yet. It's harder to get going with it.

    Too bad Beagle is written in .net; sounded like a well-integrated solution otherwise...
  • by Evets (629327) * on Monday October 01, 2007 @06:59PM (#20817265) Homepage Journal
    Microsoft Index... oh nevermind. I can't get it out with a straight face.

    Lucene is the way to go. There are APIs for Perl for dealing with Lucene data sets and for many other languages as well. Nutch is a good place to start getting to know the power of Lucene - you can get a nutch crawler interface up and running quickly and you can browse through some of the source files to get an understanding of how to bring in various file formats - Office documents, PDFs, etc.

    The Google Search boxes are decent, but with any commercial solution you end up paying fees for the amount of documents in your index. They open source the code, presumably because of OSS components (maybe even Lucene) but the documentation they publish is laughable.
  • by MarkWatson (189759) on Monday October 01, 2007 @07:12PM (#20817371) Homepage
    There are 2 problems: getting plain text out of documents, then indexing the plain text

    A good tool for getting plain text out of various versions of Word documents is the "antiword" command line utility.

    The Apache POI project (Java) can read and write several Microsoft Office formats.

    For indexing: I like Lucene (Java), Ferret (Ruby+C), and Montezuma (Common Lisp).

    I have mostly been using Ruby the last few years for text processing. Here is a short article I wrote using the Java Lucene library using JRuby: []

    Here is another short snippet for reading documents in Ruby: []


    You might just want to use the entire Nutch stack: []

    stack that collects documents, spiders the web, has plugins for many document types, etc. Good stuff!
  • by djpretzel (891427) on Monday October 01, 2007 @07:23PM (#20817457) Homepage
    As a Documentum developer, especially in light of the recent 6.0 release, I'd be remiss not to recommend it for such a purpose. It's expensive, rather complex, and requires solid development talent to implement, but is almost infinitely configurable and customizable, and there are separate components (at cost, of course) that can add on all sorts of fun functionality like collaboration, digital asset management, etc. It has the ability to auto-tag documents based on configurable rules using Content Intelligence Services and supports extensible object hierarchies, workflows, lifecycles, taxonomies, web services, you name it. It's probably overkill for the user in question, and it's far from open source (although EMC is doing an admirable job at encouraging code exchange, and the new dev. environment is based on Eclipse), but it's pretty darn slick when you look at the ground it covers, functionally.
  • by gvc (167165) on Monday October 01, 2007 @07:52PM (#20817657)
    It is free, libre. Wumpus Search.
  • by bbdd (733681) on Monday October 01, 2007 @07:55PM (#20817693)
    well, this may not apply to you, as you do not mention the size/number of items to index.

    but, for small shops where there is no money to throw at this type of thing, try IBM OmniFind Yahoo! Edition. can't beat the price. []
  • by ivi (126837) on Monday October 01, 2007 @08:45PM (#20818093)
    Haven't you listened to the TED talk (cf: []
    on Spaghetti Sauces, which refers to Moskowicz's idea that:

      There isn't a (one) best , only best

    eg, best spaghetti sauceS, etc.

    Some find one best, others find another best [for them]...

    Whatcha think?
  • by queenb**ch (446380) on Monday October 01, 2007 @08:52PM (#20818135) Homepage Journal
    The project that I worked on was also concerned with who was able to access the data. For that reason, we used a wiki-like format, converted everything into text, using a variety of conversion methods, and assigned access controls to it via a in-house web based application.

    This allowed for the full text to be searchable, provided a reference back to the original file. If it was in a digital format, like a Word document, it was also stored in the database. If it, it referenced a physical file. The user could suggest modifications to the searchable entry if an error was found. The archive team would investigate the suggested correction and usually implement it.

    2 cents,

  • by LionKimbro (200000) on Monday October 01, 2007 @08:56PM (#20818175) Homepage
    I once spent 4 hours hacking together a symbol indexer for the 10,000+ CPP files in our source code repository. I wrote it in Python. It worked by brute force: "For every directory, for every .h or .cpp or .c file, crack it open, and, line by line, look for all instances of this regex..."

    It's a little slow-- 10 seconds to look up all instances of a symbol. And it takes ~3 hours to refresh the full index.

    But is saves an enormous amount of time, makes impossible tasks possible, and I have used it every day since. It's been about a 8 months now, and it's been absolutely wonderful.

    It would have taken far longer and many more resources to begin to figure out how to hook in Lucerne, or some other heavy duty package.
  • To answer the original question, I'm currently using Microsoft's Indexing Server. I bought SearchSimon for $25 or so - basically a disappointment, but still a worthwhile purchase for me, since I found it useful to look at the included source code. I found an article which I can't find right now about rolling your own page based on the Index Server engine.

    But this raises my question - how do you enforce metadata tagging? I can't even find a decent Windows-based metadata browser/editor for after-the-fact bulk tagging. I'm aware that certain commercial projects hook the save function on MS Office and replace your save dialog bog with one that has required metadata fields. Is there a way to get that functionality without paying thousands and thousands of dollars? Or, if one is going to pay that kind of money, what project should one go with?

    PS - Thanks to everyone mentioning Nutch/Xapian, I will definitely check them out tomorrow.
  • by jshriverWVU (810740) on Monday October 01, 2007 @09:28PM (#20818439)
    Personally I find you can search all text data with Grep, and if you need to search binary data just pass it threw strings and grep that :)

    1/2 joking 1/2 serious.

    grep -R "foobar" /

  • by mattr (78516) <.moc.ydobelet. .ta. .rttam.> on Monday October 01, 2007 @10:39PM (#20818927) Homepage Journal
    Someone mentioned htDig. I would just like to mention that I had much success with it. It's a C++ based crawler and search engine with customizable templates. I built a mod_perl wrapper to search 60 databases, a total of 1GB and got response times of about 0.1 seconds per query, including fuzzy searching. Actually it has so many thinks to tweak it is crazy. However this was a while ago and you may want to check the others mentioned here.
  • by PeteyG (203921) on Tuesday October 02, 2007 @12:32AM (#20819575) Homepage Journal
    Consider a Google search box []

    Or, you know, you could add meta data to each and every single page you want to index... I'd personally rather stab my eyes out with a ballpoint pen.
  • by LauraW (662560) on Tuesday October 02, 2007 @01:59AM (#20820021)

    Other people have said most of this already, but the ones I've seen used the most are:

    • Swish-E: easy to set up, easy to script from Perl or whatever, but not very good results. I used this on a web site I ran about 7 or 8 years ago, and it worked pretty well, especially considering the state of the art at the time. I can't remember the licensing terms.
    • Lucene: Parses lots of document formats, easy to program in Java, works pretty well. Apache license
    • Google Mini: Easy to set up, good indexing, limited repository size. Closed source.
    • Google Search Appliance: Expensive, fast, big repository size. I don't know anything about administering it. Closed source.

    Disclaimer: I work for Google but not on search. I definitely think you should use only as big a hammer as you need for the job, or maybe a little bigger to allow for growth. I've even seen Lucene used on small, internal, Java projects at Google where our full-blown web search infrastructure would have been the equivalent of a thermonuclear flyswatter.

  • by Tim C (15259) on Tuesday October 02, 2007 @03:13AM (#20820329)
    You have some documents you want to index. How many? How many users? What advanced features do you need (if any)? What's your budget? What technologies and languages are you comfortable with? What OS does it need to run on?

    Where I work, we've used htdig, Verity K2 and Google search appliance, and have looked at (and heard good things about) Lucene.

    Which one I'd recommend would depend entirely on the answers to my questions.
  • by chris_sawtell (10326) on Tuesday October 02, 2007 @06:06AM (#20820999) Journal []

    To get more info including a peep into the book do a Google search on "Managing Gigabytes"

    otoh for something cheap and cheerful there is htdig. []

    It's remarkably good for indexing an intranet.
  • by iq1 (787035) on Tuesday October 02, 2007 @06:34AM (#20821121)
    i know this will give me flames, but:
    you might try Oracle Text (also part of Oracle XE).

    Supports 140 document formats, has a lot of options and works via SQL.
    Can build indexes for documents stored in DB or in the file system.
    You can even join the serach terms from the document with the database records where metadata might be stored by your application.
    I found that very helpful in similar projects. And it's free.
  • by rclandrum (870572) on Tuesday October 02, 2007 @09:25AM (#20821945) Homepage
    As someone who has made a 30-year career out of designing and building document management systems, I would urge you to look first at how you expect your users to find the documents they need. The expected results of a search should guide your choice of indexing methods - and the popular "meta tagging" method isn't always the best. There are shortcomings with all methods.

    Full-text indexing allows users to search the entire contents of documents, but the results are imprecise and voluminous and not terribly useful in most cases (think web search engines here). Yes, you can find all documents that contain the word "patent", but you get a lot of old references to patent leather shoes in addition to what you were probably after. So, with full-text search you get it all, but force the user to subsearch for what they really want.

    Using meta-tags gives the appearance of pre-classifying documents and having the users do it themselves means you don't have to have a dedicated person to assign the tags. The disadvantage is that everybody makes up their own tags or if you have a standard set, you have to rely on people being diligent about applying them. And tag popularity can easily change over time. For example, if you want to find docs that refer to "removable media", this might have garnered a "floppy" tag 15 years ago and "CD" or "DVD" today. You are therefore almost guaranteed of missing some documents using this method.

    Database indexing means that you list all your docs in a database, perhaps by title, author, date, or other fields that your users would find useful for searching. The advantage is that every doucment is indexed the same way, searching is really fast, and the results are usually relevant if your schema is meaningful. The disadvantages are that indexing the docs takes work on input and users need to know how to search to get the best results.

    Finally, you could organize the docs by simple name and folder. This works fine for the desktop and users usually can identify the category that points them to the folder they want. The disadvantage is that this only works well for limited document sets. Once you start getting hundreds of categories and thousands and thousands of documents, things become too hard to find.

    So - understand your users search requirements and the size of your expected database. Only then can you make an informed decision about how to create and index the repository.

"Facts are stupid things." -- President Ronald Reagan (a blooper from his speeach at the '88 GOP convention)