Best Way to Build a Searchable Document Index? 216
Blinocac writes "I am organizing the IT documentation for the agency I work for, and we would like to make a searchable document index that would render results based on meta tags placed in the documents, which include everything from Word files, HTML, Excel, Access, and PDF's." What methods or tools have others seen that work? Anything to avoid?
Check out Alfresco! (Score:3, Interesting)
Alfresco also has plugins for Microsoft Office so you can manage the repository from Word, etc. They are also working on OpenOffice integration.
Don't use SAMBA for
I am not affiliated with Alfresco, just a happy user.
Re:Google Desktop or Applicance (Score:1, Interesting)
Re:Google (Score:3, Interesting)
Re:Check out Alfresco! (Score:1, Interesting)
Re:use a Wiki instead (Score:3, Interesting)
Re:Google Desktop or Applicance (Score:2, Interesting)
Like software development, the quality of the outcome is implementation dependent.
We run six thesauri, plus a number of different controlled lists for our users to input metadata. We don't publish any documents that don't have meta attached, and we perform random quality audits. Our users have been trained in fundamentals of classification and also in the payoff for getting it right.
We use metadata to structure our navigation for some sites, we depend on it for search for our internal documents. Our metadata implementation works incredibly well for us; clearly and consistently outperforming plain text indexing.
Re:Gee I don't know.. (Score:3, Interesting)
Re:Lucene (Score:3, Interesting)
What is more complicated is to deal with large numbers of concurrent requests. Then you need clustering. There are big sites that do both partitioning and clustering simultaneously with Lucene. I seem to recall reading that Technorati uses Lucene on a cluster of 40 servers. With 8TB of data you are going to have a big one time computational problem to build the index. Lucene can recombine indexes, so you could distribute index creation to a lot of servers.
How about Oracle text (Score:2, Interesting)
you might try Oracle Text (also part of Oracle XE).
Supports 140 document formats, has a lot of options and works via SQL.
Can build indexes for documents stored in DB or in the file system.
You can even join the serach terms from the document with the database records where metadata might be stored by your application.
I found that very helpful in similar projects. And it's free.