Best Way to Build a Searchable Document Index? - Slashdot

Catch up on stories from the past week (and beyond) at the Slashdot story archive

×

Best Way to Build a Searchable Document Index? 216

Posted by ScuttleMonkey on Monday October 01, 2007 @05:36PM from the build-a-better-boss-trap dept.

Blinocac writes "I am organizing the IT documentation for the agency I work for, and we would like to make a searchable document index that would render results based on meta tags placed in the documents, which include everything from Word files, HTML, Excel, Access, and PDF's." What methods or tools have others seen that work? Anything to avoid?

This discussion has been archived. No new comments can be posted.

Best Way to Build a Searchable Document Index?

Search 216 Comments Log In/Create an Account

Comments Filter:

Check out Alfresco! (Score:3, Interesting)

by thule ( 9041 ) writes: on Monday October 01, 2007 @06:01PM (#20816713) Homepage

I posted this before on slashdot. I discovered a while ago a cool system called Alfresco [alfresco.com]. There is a free (as is liberty) and commercial versions. It acts like a SMB (like SAMBA), ftp, and WebDAV server so you don't have to use the web interface to get files into the system. Users can map it as a network drive. The web interface allows users to set metatags, retrieve previous versions of the file, and most importantly, search the documents in the system.

Alfresco also has plugins for Microsoft Office so you can manage the repository from Word, etc. They are also working on OpenOffice integration.

Don't use SAMBA for .doc and .pdf's, use Alfresco.

I am not affiliated with Alfresco, just a happy user.

Share
twitter facebook
Re:Google Desktop or Applicance (Score:1, Interesting)

by Anonymous Coward writes: on Monday October 01, 2007 @06:11PM (#20816783)

and copernic desktop search outperforms GDS by a long way...

Parent Share
twitter facebook
Re:Google (Score:3, Interesting)

by TooMuchToDo ( 882796 ) writes: on Monday October 01, 2007 @06:20PM (#20816867)

Seconded. I've done implementations (hosted an in-house) of both Google Minis as well as the full blown Enterprise appliances. They are amazing creatures. I would recommend the Mini to almost anyone, while the Enterprise costs a pretty penny.

Parent Share
twitter facebook
Re:Check out Alfresco! (Score:1, Interesting)

by G1369311007 ( 719689 ) writes: <StevenJSSanders@@@yahoo...com> on Monday October 01, 2007 @06:47PM (#20817155) Journal

Along the same lines of Alfresco is Plone. www.plone.org I'm currently the sole admin of a Plone site serving ~50 users on an intranet. We use it for document management etc. Just another option. The CIA's website is made in Plone so it can't be that bad right?! www.cia.gov

Parent Share
twitter facebook
Re:use a Wiki instead (Score:3, Interesting)

by LadyLucky ( 546115 ) writes: on Monday October 01, 2007 @08:12PM (#20817805) Homepage

We've been using Confluence, from Atlassian [atlassian.com.au] for our wiki, and it's pretty fully featured for a wiki.

Parent Share
twitter facebook
Re:Google Desktop or Applicance (Score:2, Interesting)

by anthonys_junk ( 1110393 ) writes: <(moc.liamg) (ta) (knujsynohtna)> on Monday October 01, 2007 @11:25PM (#20819187)

Meta-data is one of those things that seems like a really good idea, but like all plans, doesn't tend to survive contact with the enemy, which in this case is the user.

Like software development, the quality of the outcome is implementation dependent.

We run six thesauri, plus a number of different controlled lists for our users to input metadata. We don't publish any documents that don't have meta attached, and we perform random quality audits. Our users have been trained in fundamentals of classification and also in the payoff for getting it right.

We use metadata to structure our navigation for some sites, we depend on it for search for our internal documents. Our metadata implementation works incredibly well for us; clearly and consistently outperforming plain text indexing.

Parent Share
twitter facebook
Re:Gee I don't know.. (Score:3, Interesting)

by ditto999999999999999 ( 546129 ) writes: on Monday October 01, 2007 @11:33PM (#20819249)

It's true. I finish a double major in University, worked in a relevent field the whole time, have excellent references, and now I can't find work... Hire me and I will do this for you.

Parent Share
twitter facebook
Re:Lucene (Score:3, Interesting)

by bwt ( 68845 ) writes: on Tuesday October 02, 2007 @01:03AM (#20819733)

Dealing with large data sets isn't really technologically challenging. You can grow to an arbitrarily large data set size simply by partitioning: it works for google. It may be expensive to stand up a bunch of servers, but I don't think it's really that hard.

What is more complicated is to deal with large numbers of concurrent requests. Then you need clustering. There are big sites that do both partitioning and clustering simultaneously with Lucene. I seem to recall reading that Technorati uses Lucene on a cluster of 40 servers. With 8TB of data you are going to have a big one time computational problem to build the index. Lucene can recombine indexes, so you could distribute index creation to a lot of servers.

Parent Share
twitter facebook
How about Oracle text (Score:2, Interesting)

by iq1 ( 787035 ) writes: on Tuesday October 02, 2007 @06:34AM (#20821121)

i know this will give me flames, but:
you might try Oracle Text (also part of Oracle XE).

Supports 140 document formats, has a lot of options and works via SQL.
Can build indexes for documents stored in DB or in the file system.
You can even join the serach terms from the document with the database records where metadata might be stored by your application.
I found that very helpful in similar projects. And it's free.

Share
twitter facebook

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Related Links Top of the: day, week, month.

275 commentsAsk Slashdot: Why Should I Be Afraid of Artificial Intelligence?
225 commentsAsk Slashdot: What Are Some Good AI Regulations?
184 commentsSlashdot Asks: Your Favorite 2023-Made Movies and TV Shows?
163 commentsAsk Slashdot: Should Libraries Eliminate Fines for Overdue Books?
150 commentsAsk Slashdot: Can You Roll Your Own Home Router?

He has not acquired a fortune; the fortune has acquired him. -- Bion