Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
×
Software

Best Way to Build a Searchable Document Index? 216

Blinocac writes "I am organizing the IT documentation for the agency I work for, and we would like to make a searchable document index that would render results based on meta tags placed in the documents, which include everything from Word files, HTML, Excel, Access, and PDF's." What methods or tools have others seen that work? Anything to avoid?
This discussion has been archived. No new comments can be posted.

Best Way to Build a Searchable Document Index?

Comments Filter:
  • Check out Alfresco! (Score:3, Interesting)

    by thule ( 9041 ) on Monday October 01, 2007 @06:01PM (#20816713) Homepage
    I posted this before on slashdot. I discovered a while ago a cool system called Alfresco [alfresco.com]. There is a free (as is liberty) and commercial versions. It acts like a SMB (like SAMBA), ftp, and WebDAV server so you don't have to use the web interface to get files into the system. Users can map it as a network drive. The web interface allows users to set metatags, retrieve previous versions of the file, and most importantly, search the documents in the system.

    Alfresco also has plugins for Microsoft Office so you can manage the repository from Word, etc. They are also working on OpenOffice integration.

    Don't use SAMBA for .doc and .pdf's, use Alfresco.

    I am not affiliated with Alfresco, just a happy user.
  • by Anonymous Coward on Monday October 01, 2007 @06:11PM (#20816783)
    and copernic desktop search outperforms GDS by a long way...
  • Re:Google (Score:3, Interesting)

    by TooMuchToDo ( 882796 ) on Monday October 01, 2007 @06:20PM (#20816867)
    Seconded. I've done implementations (hosted an in-house) of both Google Minis as well as the full blown Enterprise appliances. They are amazing creatures. I would recommend the Mini to almost anyone, while the Enterprise costs a pretty penny.
  • by G1369311007 ( 719689 ) <StevenJSSanders@@@yahoo...com> on Monday October 01, 2007 @06:47PM (#20817155) Journal
    Along the same lines of Alfresco is Plone. www.plone.org I'm currently the sole admin of a Plone site serving ~50 users on an intranet. We use it for document management etc. Just another option. The CIA's website is made in Plone so it can't be that bad right?! www.cia.gov
  • by LadyLucky ( 546115 ) on Monday October 01, 2007 @08:12PM (#20817805) Homepage
    We've been using Confluence, from Atlassian [atlassian.com.au] for our wiki, and it's pretty fully featured for a wiki.
  • by anthonys_junk ( 1110393 ) <(moc.liamg) (ta) (knujsynohtna)> on Monday October 01, 2007 @11:25PM (#20819187)

    Meta-data is one of those things that seems like a really good idea, but like all plans, doesn't tend to survive contact with the enemy, which in this case is the user.

    Like software development, the quality of the outcome is implementation dependent.

    We run six thesauri, plus a number of different controlled lists for our users to input metadata. We don't publish any documents that don't have meta attached, and we perform random quality audits. Our users have been trained in fundamentals of classification and also in the payoff for getting it right.

    We use metadata to structure our navigation for some sites, we depend on it for search for our internal documents. Our metadata implementation works incredibly well for us; clearly and consistently outperforming plain text indexing.

  • by ditto999999999999999 ( 546129 ) on Monday October 01, 2007 @11:33PM (#20819249)
    It's true. I finish a double major in University, worked in a relevent field the whole time, have excellent references, and now I can't find work... Hire me and I will do this for you.
  • Re:Lucene (Score:3, Interesting)

    by bwt ( 68845 ) on Tuesday October 02, 2007 @01:03AM (#20819733)
    Dealing with large data sets isn't really technologically challenging. You can grow to an arbitrarily large data set size simply by partitioning: it works for google. It may be expensive to stand up a bunch of servers, but I don't think it's really that hard.

    What is more complicated is to deal with large numbers of concurrent requests. Then you need clustering. There are big sites that do both partitioning and clustering simultaneously with Lucene. I seem to recall reading that Technorati uses Lucene on a cluster of 40 servers. With 8TB of data you are going to have a big one time computational problem to build the index. Lucene can recombine indexes, so you could distribute index creation to a lot of servers.
  • by iq1 ( 787035 ) on Tuesday October 02, 2007 @06:34AM (#20821121)
    i know this will give me flames, but:
    you might try Oracle Text (also part of Oracle XE).

    Supports 140 document formats, has a lot of options and works via SQL.
    Can build indexes for documents stored in DB or in the file system.
    You can even join the serach terms from the document with the database records where metadata might be stored by your application.
    I found that very helpful in similar projects. And it's free.

He has not acquired a fortune; the fortune has acquired him. -- Bion

Working...