Stories
Slash Boxes
Comments

News for nerds, stuff that matters

Slashdot Log In

Log In

[ Create a new account ]

Best Way to Build a Searchable Document Index?

Posted by ScuttleMonkey on Mon Oct 01, 2007 04:36 PM
from the build-a-better-boss-trap dept.
Blinocac writes "I am organizing the IT documentation for the agency I work for, and we would like to make a searchable document index that would render results based on meta tags placed in the documents, which include everything from Word files, HTML, Excel, Access, and PDF's." What methods or tools have others seen that work? Anything to avoid?

Related Stories

This discussion has been archived. No new comments can be posted.
Display Options Threshold:
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • Gee I don't know.. (Score:4, Funny)

    by Anonymous Coward on Monday October 01, @04:38PM (#20816465)
    You're the one gettin' paid, you figure it out.
  • Lucene (Score:5, Informative)

    by v_1_r_u_5 (462399) on Monday October 01, @04:39PM (#20816485)
    Check out Apache's free Lucene engine, found at lucene.apache.org/ [apache.org]. Lucene is a powerful indexing engine that handles all kinds of docs, and you can easily mod it to handle whatever it doesn't. It also allows custom scoring and a very powerful query language.
    • Re:Lucene (Score:5, Informative)

      by Anonymous Coward on Monday October 01, @04:43PM (#20816545)
      yes. It's hard to beat Lucene if you don't mind working at the API level. If you want a ready-build web crawler, check out Nutch, which is based on Lucene.
      [ Parent ]
      • Re:Lucene by mynickwastaken (Score:1) Tuesday October 02, @02:19AM
      • Re:Lucene by M. Baranczak (Score:2) Tuesday October 02, @09:53AM
      • 1 reply beneath your current threshold.
    • Re:Lucene (Score:4, Informative)

      by BoberFett (127537) on Monday October 01, @04:46PM (#20816583)
      I haven't used Lucene, but as for commercial software I've used dtSearch and ISYS and they are both excellent full text search engines. Both have web interfaces as well as desktop programs, and SDKs are available for custom applications. They scale to a massive number of documents per index, in a large variety of formats and are very fast. They have additional features like returning documents of all types in HTML so no reader is required on the front end other than a browser so legacy formats are easier to access.
      [ Parent ]
    • Re:Lucene (Score:5, Informative)

      Lucene's good. If you haven't yet have a look at Ferret, a port of Lucene for Ruby. It's listed as faster than Lucene. I've used it in 20+ projects now as my built-in fulltext index of choice, and it's pretty great. You can easily define your own ranking algorithms if you'd like. You can find more information on Ferret here: http://ferret.davebalmain.com/trac/ [davebalmain.com]

      I've got a prototype of the system described in the OP that we did while quoting a fairly large project. It's really easy to have an 'after upload' action that'll push the document through strings (or some other third party app that can operate similarly, given the document type) and throw the strings into a field that gets indexed as well. That pretty much handles everything you may need.

      Obviously I'd also allow someone to specify keywords when uploading a document, but if this engine's going to just be thrown against an existing cache of documents, strings-only's the way to go.
      [ Parent ]
      • Re:Lucene by caseydk (Score:3) Monday October 01, @05:38PM
      • Re:Lucene by daemous (Score:1) Tuesday October 02, @09:30AM
    • Re:Lucene by Dadoo (Score:2) Monday October 01, @06:04PM
      • Re:Lucene by dilute (Score:3) Monday October 01, @08:00PM
        • Re:Lucene (Score:5, Informative)

          "If you needed to run, say 100 indexing engines in parallel and merge the indexes, you'd have to research that. Somebody's probably done it."

          Yes, they have. In my previous job we had to search 2 terrabytes of plain text data (HTML) really fast. The company chose Autonomy, and many developers spent many months trying to make it work, consuming insane amounts of hardware resources for mediocre results, and still requiring . One lone (and brilliant) dev whipped up a Lucene proof of concept in a weekend, and it was faster (full index in a day) required less resources (a single HP DL 585, 16GB RAM, 4xdual core AMD as opposed to 10 of the same), had a smaller index (about a 5th of Autonomies'), returned results faster, the result set was more accurate, and was significantly more flexible in making it do what we actually needed it to do.

          Lucene wins hands down
          [ Parent ]
        • Re:Lucene by ubrgeek (Score:2) Tuesday October 02, @02:48PM
      • Re:Lucene by bwt (Score:3) Tuesday October 02, @12:03AM
        • Re:Lucene by pthisis (Score:2) Wednesday October 10, @04:47PM
    • Re:Lucene by Cato (Score:3) Tuesday October 02, @12:37AM
    • Search software by j.leidner (Score:3) Tuesday October 02, @06:34AM
    • Re:Only one real choice by pasamio (Score:2) Tuesday October 02, @03:03AM
    • 3 replies beneath your current threshold.
  • Google (Score:3, Informative)

    by Anonymous Coward on Monday October 01, @04:40PM (#20816493)
    We have a Google appliance, but you can do it with regular Google, too. Just make sure you disable caching (with headers or by encrypting documents). Then place an IP or password restriction for non-Google crawlers (check IP, not user-agent). People will be able to search with the power of Google, but only people you allow in will be able to get the full documents.

    If you value your privacy, invest in a Google mini, though.
    • Re:Google (Score:5, Informative)

      by rta (559125) on Monday October 01, @04:49PM (#20816607)
      Previous place i worked we had a Google Mini and it was better than anything we had come up with in-house.

      We even pointed it at the web-cvs server and bugzilla and it was great at searching those too.

      To see all the bugs still open against v 2.2.1 or something like that bugzilla's own search was better. but for searching for "bugs about X" the google mini was great.

      It only cost something like $3k ircc.

      not exactly what you asked about, but you should definitely see if this wouldn't work for you instead.
      [ Parent ]
      • Re:Google by TooMuchToDo (Score:3) Monday October 01, @05:20PM
        • Re:Google (Score:4, Informative)

          The company I work for, Vivisimo [vivisimo.com], makes an awesome search engine. Although I've never dealt with the Google box directly, I know that we have had customers get fed up with the Google box and replace it quite easily with our software. Click the first link to see a pretty flash demo, or go to Clusty.com [slashdot.org] to try out a subset of the functionality for real. We specialize in "complex, heterogenous search solutions", which exactly fits most intranet sites I've seen. Files are on SMB shares, local disks, Sharepoint, Lotus, Documentum, IMAP, Exchange, etc, etc, etc. We connect to all those sources and provide a unified interface. You can do really neat tricks with combining content across multiple repositories, such as metadata from a database added to files on SMB shares. We support Linux, Solaris, and Windows, all 32 and 64 bit. Although I may work here, it really is a great product, and I use it at home to crawl my email archives and various blogs, websites, forums, things that I use frequently but have sucky search.
          [ Parent ]
          • Re:Google by TooMuchToDo (Score:2) Monday October 01, @08:10PM
            • Re:Google by shepmaster (Score:2) Monday October 01, @08:25PM
          • Re:Google by shepmaster (Score:2) Monday October 01, @08:27PM
        • Re:Google by aoteoroa (Score:1) Monday October 01, @10:22PM
      • Re:Google by A Non Mouse Cowhand (Score:1) Tuesday October 02, @11:02AM
      • 1 reply beneath your current threshold.
    • Re:Google by rgaginol (Score:2) Monday October 01, @08:48PM
    • 1 reply beneath your current threshold.
  • Meta tags placed? (Score:4, Insightful)

    by harmonica (29841) on Monday October 01, @04:40PM (#20816497)
    Who places what types of meta tags in the documents? I don't understand the requirements.

    Generally, Lucene [apache.org] does a good job. It's easy to learn and performance was fine for me and my data (~ 2 GB of textual documents).
    • Meta tags are worthless, generally (Score:4, Insightful)

      by Anonymous Coward on Monday October 01, @05:35PM (#20817029)
      Meta tags are worthless, generally, unless you have a librarian who ensures correctness.
      DON'T TRUST USERS TO ENTER META DATA!!!
      I've worked in electronic document management in 3 different businesses and metadata entered by end users is worst than worthless - it is wrong. Searches that don't use full text for general documents are less than ideal.

      Just to prove that you're question is missing critical data:
        - how many documents?
        - how large is the average and largest documents?
        - what format will be input? PDF, HTML, XLS, PPT, OO, C++, what?
        - what search tools do you use elsewhere?
        - any budget constraints?
        - did you look at general document management systems? Documentum, Docushare, Filenet, Sharepoint? If so, what didn't work with these systems?
        - Did you consider OSS solutions? htdig, e-swish, custom searching?
        - A buddy of mine wrote an article on "how to index anything" that was in the Linux Journal a few years ago. Google is your friend.

      AND if i didn't get this across yet - DON'T TRUST META DATA IN HIDDEN DOCUMENT FIELDS - bad Metadata in MS-Office files will completely destroy the usefulness of your searches.
      [ Parent ]
    • Re:Meta tags placed? by rainmayun (Score:2) Monday October 01, @05:43PM
  • Google Desktop or Applicance (Score:4, Insightful)

    by wsanders (114993) on Monday October 01, @04:40PM (#20816507)
    Because if you have to spend more than an hour on this kind of project nowadays, you're wasting your time.

    The inexpensive Google appliacances don't have very fine-grained access control, though. But I am involved in several semi-failed projects of this nature in my organization, but new and legacy, and my Google Desktop outperforms all of them.
  • Swish-E (Score:3, Informative)

    by ccandreva (409807) on Monday October 01, @04:41PM (#20816517)
    (http://www.westnet.com/~chris/)
  • Open Source by HartDev (Score:1) Monday October 01, @04:43PM
  • Beagle, Spotlight? (Score:3, Insightful)

    Is this something that would suit your needs: Beagle for Linux [wikipedia.org], Spotlight for OSX [wikipedia.org]? I haven't tried Beagle (I don't have root access on my Debian installation at work), but Spotlight is probably my most cherished feature in OSX... it's so useful.
  • Google Applicance by spcherub (Score:1) Monday October 01, @04:44PM
  • All depends on the document filters ... by molarmass192 (Score:2) Monday October 01, @04:45PM
  • mnogosearch by nereid666 (Score:1) Monday October 01, @04:45PM
  • BOFH by wilymage (Score:2) Monday October 01, @04:47PM
  • by mind21_98 (18647) on Monday October 01, @04:50PM (#20816611)
    (http://www.thoughtbug.com/ | Last Journal: Thursday September 27, @05:52PM)
    If you're using Office 2007, you can probably hack something together really quickly to pull the meta tags from the files and put them in a database. Not sure about the other formats you need, though--and support from Google, for instance, would probably be beneficial for your company anyway. Hope that helps!
  • Google corporate search by Gwala (Score:2) Monday October 01, @04:50PM
  • what to avoid by Anonymous Coward (Score:2) Monday October 01, @04:52PM
  • Most easy solution (Score:4, Informative)

    it wil cost you some bucks just buy MS sharepoint portal server, and leave the indexing over to sharepoint.
    Your not even realy required to use added tags... (as most people will put in poor tags).

    But if you like you can add tags even with sharepoint.
    • Re:Most easy solution (Score:4, Informative)

      by DigitalSorceress (156609) on Monday October 01, @06:23PM (#20817455)
      Actually, if you are an MS shop and have Microsoft Server 2003, SharePoint Services 3.0 (as opposed to the SharePoint Portal server (now renamed, I believe, to Microsoft Office SharePoint Server) which does indeed cost a packet.

      I do a lot of LAMP development, and I'm not the strongest fan of Microsoft for a lot of things, but if you have a MS desktop and MS Office environment, SharePoint services really is quite decent for INTRANET applications. Especially for collaberation. You can set up work flows for check-out/check-in, and it integrates really nicely with some of the more recent MS Office releases. If you connect it to a real MS SQL server on the back end (as opposed to the express edition that it defaults to), you can have full text indexing even with the free SharePoint Services version. Only need for the full blown Portal/MOSS version is if you think you are going to have a large number of sharePoint sites, and want to simplify cross-connecting and management. (At least as far as I can recall)

      I'm not saying SharePoint is the way to go, but I'd at least read up on it and consider it IF you have a lot of MS Office stuff that you plan on indexing/sharing.

      I'd strongly advise avoiding it if you plan to do Internet-based stuff though... at lest until you get a good enough understanding of the security issues involved that you feel that you really know what you're doing.

      Just my $0.02 worth.
      [ Parent ]
    • Re:Most easy solution by Anonymous Coward (Score:1) Monday October 01, @06:46PM
    • Re:Most easy solution by JacobO (Score:2) Monday October 01, @07:15PM
    • Re:Most easy solution by guruevi (Score:3) Monday October 01, @10:21PM
    • Re:Most easy solution by big ben bullet (Score:2) Tuesday October 02, @02:51AM
    • Re:Most easy solution by Inda (Score:2) Tuesday October 02, @09:44AM
    • 1 reply beneath your current threshold.
  • Upload it to the web by omgamibig (Score:2) Monday October 01, @04:57PM
  • Check out Alfresco! (Score:3, Interesting)

    by thule (9041) on Monday October 01, @05:01PM (#20816713)
    (http://www.zimage.com)
    I posted this before on slashdot. I discovered a while ago a cool system called Alfresco [alfresco.com]. There is a free (as is liberty) and commercial versions. It acts like a SMB (like SAMBA), ftp, and WebDAV server so you don't have to use the web interface to get files into the system. Users can map it as a network drive. The web interface allows users to set metatags, retrieve previous versions of the file, and most importantly, search the documents in the system.

    Alfresco also has plugins for Microsoft Office so you can manage the repository from Word, etc. They are also working on OpenOffice integration.

    Don't use SAMBA for .doc and .pdf's, use Alfresco.

    I am not affiliated with Alfresco, just a happy user.
  • X1 by bhovinga (Score:1) Monday October 01, @05:10PM
  • Also see Xapian (Score:5, Informative)

    by dmeranda (120061) on Monday October 01, @05:12PM (#20816793)
    (http://deron.meranda.us/)
    I'd suggest you should consider a full-text search engine. First start here:
    http://en.wikipedia.org/wiki/Full_text_search [wikipedia.org]

    If you're not afraid to do a little reading and potentially coding a custom front end, you may want to look at two of the big open source engines: Lucene and Xapian.

    Lucene is quite popular now, and is an Apache Java project. It's a good choice if you're a Java shop.

    Xapian seems to be based on a little more solid and modern information retrieval theory and is incredibly scalable and fast. It's written in C++, with SWIG-based front ends to many languages. It might not have as polished of a front end or as fancy of a website as Lucene, but I believe it's a better choice if you have really really huge data sets or want to venture outside the Java universe.

    There are also many other wholely-contained indexers too, mostly which are based on web indexing (they have spiders, query forms, etc.) all bundled together. Like ht://Dig, mnogosearch, and so forth. They are good, especially if you want more of a drop-in solution rather than a raw indexing engine, and if you're indexing web sites (and not complex entities like databases, etc).
  • Depends on size of document base (Score:3, Insightful)

    by Nefarious Wheel (628136) * on Monday October 01, @05:14PM (#20816815)
    It depends on the size of your document base, and how you're going to store it -- if you're using something industry-strength like Documentum or Hummingbird then the Google Mini won't index it, you have to go up a notch and use the yellow box solutions. And if you're using Lotus Notes, you'll need a third party crawler such as C-Search. Google Desktop can be bent into some solutions, and it's free, but for many users you're better off having a separate server do the indexing. Google bills on the number of documents you need to keep in the index at once, and they throw in a bit of tinware to support that on a 2 year contract.

    Disclaimer: I flog Google search solutions at work, so I'm way biased.

  • use a Wiki instead (Score:5, Insightful)

    by poopie (35416) on Monday October 01, @05:20PM (#20816871)
    (Last Journal: Friday March 14 2003, @08:05PM)
    Directories full of random documents in random formats of random version with varying degrees of completeness and accuracy tend to get less useful as an information source as time goes on. Docs get abandoned and continue to provide outdated information and dead links. Doc formats change and require converters to import. Doc maintainers leave the company.

    If you work somewhere where people are not trained to attach Office docs to every email, where people don't use Word to compose 10 bullet points, where people don't use a spreadsheet as a substitute for all sorts of CRM and business applications... a Wiki is actually a good solution.

    You can use something like MediaWiki or Twiki or... heck you can use a whole variety of content management systems.

    The key to success is to *EMPOWER* people to actually update information, and have a few people who are empowered to actually edit, rehash, sort, move, prune wiki pages and content. As the content improves, it will draw in more users and more content creators. Pretty soon, employees will *COMPLAIN* when someone sends out information and doesn't update the wiki.

    Some corporate cultures are not wiki-friendly. Some management chains *fear* the wiki. Some companies have whole webmaster groups who believe it is their job to delay the process of getting useful content onto the web by controlling it. If you're in one of those companies... start up your own wiki and beg for forgiveness later.
  • I'd Be Interested... by morari (Score:1) Monday October 01, @05:24PM
  • True hackers by athloi (Score:2) Monday October 01, @05:26PM
  • FileNet by drhamad (Score:2) Monday October 01, @05:29PM
    • Re:FileNet by James Youngman (Score:2) Monday October 01, @05:45PM
    • Re:FileNet by rainmayun (Score:1) Monday October 01, @05:52PM
    • Re:FileNet by drhamad (Score:1) Monday October 01, @05:53PM
  • I'll plug my software by vondo (Score:2) Monday October 01, @05:31PM
  • $$$ - Universal Content Management by hrieke (Score:2) Monday October 01, @05:33PM
  • Google by JimDaGeek (Score:2) Monday October 01, @05:34PM
  • Swish++, HyperEstraier by ecloud (Score:2) Monday October 01, @05:46PM
  • Requirements spec by Anonymous Coward (Score:1) Monday October 01, @05:50PM
  • cough cough by Evets (Score:2) Monday October 01, @05:59PM
  • by MarkWatson (189759) on Monday October 01, @06:12PM (#20817371)
    (http://www.markwatson.com/)
    There are 2 problems: getting plain text out of documents, then indexing the plain text

    A good tool for getting plain text out of various versions of Word documents is the "antiword" command line utility.

    The Apache POI project (Java) can read and write several Microsoft Office formats.

    For indexing: I like Lucene (Java), Ferret (Ruby+C), and Montezuma (Common Lisp).

    I have mostly been using Ruby the last few years for text processing. Here is a short article I wrote using the Java Lucene library using JRuby:

    http://markwatson.com/blog/2007/06/using-lucene-with-jruby.html [markwatson.com]

    Here is another short snippet for reading OpenOffice.org documents in Ruby:

    http://markwatson.com/blog/2007/05/why-odf-is-better-than-microsofts.html [markwatson.com]

    ---

    You might just want to use the entire Nutch stack:

    http://lucene.apache.org/nutch/ [apache.org]

    stack that collects documents, spiders the web, has plugins for many document types, etc. Good stuff!
  • If money's not an object... by djpretzel (Score:2) Monday October 01, @06:23PM
  • WSS 3.0 by madagajs (Score:1) Monday October 01, @06:33PM
  • Install Wumpus Search by gvc (Score:2) Monday October 01, @06:52PM
  • Full text search? by CoffeeIsMyGod (Score:1) Monday October 01, @06:54PM
  • IBM OmniFind by bbdd (Score:2) Monday October 01, @06:55PM
  • Lucene by AtomicDevice (Score:1) Monday October 01, @07:10PM
  • You may be asking the WRONG question... by ivi (Score:2) Monday October 01, @07:45PM
  • Access Control and Searching by queenb**ch (Score:2) Monday October 01, @07:52PM
  • Hack Something Together? by LionKimbro (Score:2) Monday October 01, @07:56PM
  • An answer and a question - metadata tagging? by JonToycrafter (Score:2) Monday October 01, @08:10PM
  • Grep and strings by jshriverWVU (Score:2) Monday October 01, @08:28PM
  • Commercial tool also available by lotus87 (Score:1) Monday October 01, @08:40PM
  • htDig by mattr (Score:2) Monday October 01, @09:39PM
  • Forget meta, it's all about the content by aussiedood (Score:1) Monday October 01, @10:19PM
  • Google search by PeteyG (Score:2) Monday October 01, @11:32PM
  • Another suggestion and things to look our for by PassMark (Score:1) Tuesday October 02, @12:32AM
  • There are lots of choices... by LauraW (Score:2) Tuesday October 02, @12:59AM
  • Alfresco by golemwashere (Score:1) Tuesday October 02, @01:39AM
  • openkast by DangerousDriver (Score:1) Tuesday October 02, @02:03AM
  • So, what are the requirements? by Tim C (Score:2) Tuesday October 02, @02:13AM
  • Need some advice by omega4711 (Score:1) Tuesday October 02, @02:34AM
  • dunno, but probably not the way I did it... by tomandlu (Score:1) Tuesday October 02, @03:39AM
  • Aduna Aufofocus/Metadata Server by xanton159 (Score:1) Tuesday October 02, @04:06AM
  • Managing Gigabytes by chris_sawtell (Score:2) Tuesday October 02, @05:06AM
  • How about Oracle text by iq1 (Score:2) Tuesday October 02, @05:34AM
  • If you have any spare time left over...... by heffrey (Score:1) Tuesday October 02, @05:49AM
  • Lucene Subprojects by esme (Score:2) Tuesday October 02, @07:17AM
  • kinosearch, swish-e, zebra, ht:/dig, etc. by ericleasemorgan (Score:1) Tuesday October 02, @07:46AM
  • by rclandrum (870572) on Tuesday October 02, @08:25AM (#20821945)
    (http://www.mindwrap.com/)
    As someone who has made a 30-year career out of designing and building document management systems, I would urge you to look first at how you expect your users to find the documents they need. The expected results of a search should guide your choice of indexing methods - and the popular "meta tagging" method isn't always the best. There are shortcomings with all methods.

    Full-text indexing allows users to search the entire contents of documents, but the results are imprecise and voluminous and not terribly useful in most cases (think web search engines here). Yes, you can find all documents that contain the word "patent", but you get a lot of old references to patent leather shoes in addition to what you were probably after. So, with full-text search you get it all, but force the user to subsearch for what they really want.

    Using meta-tags gives the appearance of pre-classifying documents and having the users do it themselves means you don't have to have a dedicated person to assign the tags. The disadvantage is that everybody makes up their own tags or if you have a standard set, you have to rely on people being diligent about applying them. And tag popularity can easily change over time. For example, if you want to find docs that refer to "removable media", this might have garnered a "floppy" tag 15 years ago and "CD" or "DVD" today. You are therefore almost guaranteed of missing some documents using this method.

    Database indexing means that you list all your docs in a database, perhaps by title, author, date, or other fields that your users would find useful for searching. The advantage is that every doucment is indexed the same way, searching is really fast, and the results are usually relevant if your schema is meaningful. The disadvantages are that indexing the docs takes work on input and users need to know how to search to get the best results.

    Finally, you could organize the docs by simple name and folder. This works fine for the desktop and users usually can identify the category that points them to the folder they want. The disadvantage is that this only works well for limited document sets. Once you start getting hundreds of categories and thousands and thousands of documents, things become too hard to find.

    So - understand your users search requirements and the size of your expected database. Only then can you make an informed decision about how to create and index the repository.
  • And if you want some reading. by StarfishOne (Score:2) Tuesday October 02, @08:49AM
  • Recommend the FAST product by Evil W1zard (Score:2) Tuesday October 02, @11:36AM
  • SCAN by Johann_NL (Score:1) Tuesday October 02, @12:42PM
  • rhe best searchable softwatr? by hellmuthchileno (Score:1) Wednesday October 03, @07:39PM
  • Thanks by Blinocac (Score:1) Monday October 08, @03:06PM
  • When is a directory with files no longer good? by jollyreaper (Score:2) Tuesday October 09, @09:46AM
  • Navigate the metadata, use Dieselpoint by ccleve (Score:1) Thursday October 11, @06:01PM
  • Re:Avoid kat and beagle by Tablizer (Score:1) Monday October 01, @04:54PM
  • Re:Livelink (Score:3, Funny)

    You are in marketing aren't you?

    (I'm sold anyway)
    [ Parent ]
    • 1 reply beneath your current threshold.
  • Re:Easy (Score:3, Informative)

    If you host all of your documentation on a website, take a look at ht://dig [http://www.htdig.org].

    I've deployed it across a handful of servers, and it does a good job of crawling, but doesn't do well with javascript. If you have javascript for your web's frontend, you can write a shell script to find . -print, prepend the urls into a file, and point htdig at that file. It will dig into each file it finds, and create a searchable database of everything that it finds.

    You add /cgi-bin/search.cgi to your page, and you can auto-magically search your documentation.

    - Avron
    [ Parent ]
  • 18 replies beneath your current threshold.