Slashdot Log In
Best Way to Build a Searchable Document Index?
Posted by
ScuttleMonkey
on Mon Oct 01, 2007 05:36 PM
from the build-a-better-boss-trap dept.
from the build-a-better-boss-trap dept.
Blinocac writes "I am organizing the IT documentation for the agency I work for, and we would like to make a searchable document index that would render results based on meta tags placed in the documents, which include everything from Word files, HTML, Excel, Access, and PDF's." What methods or tools have others seen that work? Anything to avoid?
Related Stories
This discussion has been archived.
No new comments can be posted.
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
Full
Abbreviated
Hidden
Loading... please wait.
Gee I don't know.. (Score:4, Funny)
Lucene (Score:5, Informative)
Re:Lucene (Score:5, Informative)
Parent
Re:Lucene (Score:4, Informative)
Parent
Re:Lucene (Score:5, Informative)
I've got a prototype of the system described in the OP that we did while quoting a fairly large project. It's really easy to have an 'after upload' action that'll push the document through strings (or some other third party app that can operate similarly, given the document type) and throw the strings into a field that gets indexed as well. That pretty much handles everything you may need.
Obviously I'd also allow someone to specify keywords when uploading a document, but if this engine's going to just be thrown against an existing cache of documents, strings-only's the way to go.
Parent
Re:Lucene (Score:5, Informative)
Yes, they have. In my previous job we had to search 2 terrabytes of plain text data (HTML) really fast. The company chose Autonomy, and many developers spent many months trying to make it work, consuming insane amounts of hardware resources for mediocre results, and still requiring . One lone (and brilliant) dev whipped up a Lucene proof of concept in a weekend, and it was faster (full index in a day) required less resources (a single HP DL 585, 16GB RAM, 4xdual core AMD as opposed to 10 of the same), had a smaller index (about a 5th of Autonomies'), returned results faster, the result set was more accurate, and was significantly more flexible in making it do what we actually needed it to do.
Lucene wins hands down
Parent
Re:Lucene (Score:5, Funny)
Parent
Google (Score:3, Informative)
If you value your privacy, invest in a Google mini, though.
Re:Google (Score:5, Informative)
We even pointed it at the web-cvs server and bugzilla and it was great at searching those too.
To see all the bugs still open against v 2.2.1 or something like that bugzilla's own search was better. but for searching for "bugs about X" the google mini was great.
It only cost something like $3k ircc.
not exactly what you asked about, but you should definitely see if this wouldn't work for you instead.
Parent
Re: (Score:3, Interesting)
Re:Google (Score:4, Informative)
Parent
Meta tags placed? (Score:4, Insightful)
Generally, Lucene [apache.org] does a good job. It's easy to learn and performance was fine for me and my data (~ 2 GB of textual documents).
Meta tags are worthless, generally (Score:4, Insightful)
DON'T TRUST USERS TO ENTER META DATA!!!
I've worked in electronic document management in 3 different businesses and metadata entered by end users is worst than worthless - it is wrong. Searches that don't use full text for general documents are less than ideal.
Just to prove that you're question is missing critical data:
- how many documents?
- how large is the average and largest documents?
- what format will be input? PDF, HTML, XLS, PPT, OO, C++, what?
- what search tools do you use elsewhere?
- any budget constraints?
- did you look at general document management systems? Documentum, Docushare, Filenet, Sharepoint? If so, what didn't work with these systems?
- Did you consider OSS solutions? htdig, e-swish, custom searching?
- A buddy of mine wrote an article on "how to index anything" that was in the Linux Journal a few years ago. Google is your friend.
AND if i didn't get this across yet - DON'T TRUST META DATA IN HIDDEN DOCUMENT FIELDS - bad Metadata in MS-Office files will completely destroy the usefulness of your searches.
Parent
Google Desktop or Applicance (Score:4, Insightful)
The inexpensive Google appliacances don't have very fine-grained access control, though. But I am involved in several semi-failed projects of this nature in my organization, but new and legacy, and my Google Desktop outperforms all of them.
Re: (Score:3, Informative)
I set up a Google Mini for indexing an internal wiki, our bug tracking system, and some other systems, and it is very straight-forward.
I know the original question mentioned meta-data, but you have to ask yourself if the meta-data is going to be maintained well enough that the search index will be valid. Going the Google Appliance route is so much simpler. It takes a bit of tweaking to set up the search res
Swish-E (Score:3, Informative)
Beagle, Spotlight? (Score:3, Insightful)
XML document formats (Score:3)
Most easy solution (Score:4, Informative)
Your not even realy required to use added tags... (as most people will put in poor tags).
But if you like you can add tags even with sharepoint.
Re:Most easy solution (Score:4, Informative)
I do a lot of LAMP development, and I'm not the strongest fan of Microsoft for a lot of things, but if you have a MS desktop and MS Office environment, SharePoint services really is quite decent for INTRANET applications. Especially for collaberation. You can set up work flows for check-out/check-in, and it integrates really nicely with some of the more recent MS Office releases. If you connect it to a real MS SQL server on the back end (as opposed to the express edition that it defaults to), you can have full text indexing even with the free SharePoint Services version. Only need for the full blown Portal/MOSS version is if you think you are going to have a large number of sharePoint sites, and want to simplify cross-connecting and management. (At least as far as I can recall)
I'm not saying SharePoint is the way to go, but I'd at least read up on it and consider it IF you have a lot of MS Office stuff that you plan on indexing/sharing.
I'd strongly advise avoiding it if you plan to do Internet-based stuff though... at lest until you get a good enough understanding of the security issues involved that you feel that you really know what you're doing.
Just my $0.02 worth.
Parent
Check out Alfresco! (Score:3, Interesting)
Alfresco also has plugins for Microsoft Office so you can manage the repository from Word, etc. They are also working on OpenOffice integration.
Don't use SAMBA for
I am not affiliated with Alfresco, just a happy user.
Also see Xapian (Score:5, Informative)
http://en.wikipedia.org/wiki/Full_text_search [wikipedia.org]
If you're not afraid to do a little reading and potentially coding a custom front end, you may want to look at two of the big open source engines: Lucene and Xapian.
Lucene is quite popular now, and is an Apache Java project. It's a good choice if you're a Java shop.
Xapian seems to be based on a little more solid and modern information retrieval theory and is incredibly scalable and fast. It's written in C++, with SWIG-based front ends to many languages. It might not have as polished of a front end or as fancy of a website as Lucene, but I believe it's a better choice if you have really really huge data sets or want to venture outside the Java universe.
There are also many other wholely-contained indexers too, mostly which are based on web indexing (they have spiders, query forms, etc.) all bundled together. Like ht://Dig, mnogosearch, and so forth. They are good, especially if you want more of a drop-in solution rather than a raw indexing engine, and if you're indexing web sites (and not complex entities like databases, etc).
Depends on size of document base (Score:3, Insightful)
Disclaimer: I flog Google search solutions at work, so I'm way biased.
use a Wiki instead (Score:5, Insightful)
If you work somewhere where people are not trained to attach Office docs to every email, where people don't use Word to compose 10 bullet points, where people don't use a spreadsheet as a substitute for all sorts of CRM and business applications... a Wiki is actually a good solution.
You can use something like MediaWiki or Twiki or... heck you can use a whole variety of content management systems.
The key to success is to *EMPOWER* people to actually update information, and have a few people who are empowered to actually edit, rehash, sort, move, prune wiki pages and content. As the content improves, it will draw in more users and more content creators. Pretty soon, employees will *COMPLAIN* when someone sends out information and doesn't update the wiki.
Some corporate cultures are not wiki-friendly. Some management chains *fear* the wiki. Some companies have whole webmaster groups who believe it is their job to delay the process of getting useful content onto the web by controlling it. If you're in one of those companies... start up your own wiki and beg for forgiveness later.
Re: (Score:3, Funny)
(I'm sold anyway)