Best Way to Build a Searchable Document Index? 216
Blinocac writes "I am organizing the IT documentation for the agency I work for, and we would like to make a searchable document index that would render results based on meta tags placed in the documents, which include everything from Word files, HTML, Excel, Access, and PDF's." What methods or tools have others seen that work? Anything to avoid?
Gee I don't know.. (Score:4, Funny)
Re: (Score:3, Interesting)
Re: (Score:3, Funny)
Lucene (Score:5, Informative)
Re:Lucene (Score:5, Informative)
Re:Lucene (Score:4, Informative)
Re: (Score:2)
Re: (Score:2)
Re:Lucene (Score:5, Informative)
I've got a prototype of the system described in the OP that we did while quoting a fairly large project. It's really easy to have an 'after upload' action that'll push the document through strings (or some other third party app that can operate similarly, given the document type) and throw the strings into a field that gets indexed as well. That pretty much handles everything you may need.
Obviously I'd also allow someone to specify keywords when uploading a document, but if this engine's going to just be thrown against an existing cache of documents, strings-only's the way to go.
Re: (Score:3, Informative)
DocSearcher - http://docsearcher.henschelsoft.de/ [henschelsoft.de] - already does it. A friend with the US Coast Guard wrote it 4+ years ago, I deployed it within the Department of Justice for a few projects, and it's pretty widely used among some of the local tech circles. It even plugs into Tomcat if you want a web-based UI.
Re: (Score:2)
In my situation, I've got a couple dozen servers (mostly Windows, but some Linux), and maybe 8TB of data, mostly in Word documents, Excel spreadsheets, etc. Can Lucene (or Nutch) scale up to something like that? I'd also like it to search Windows network drives. Is that possible?
Re: (Score:3, Informative)
Re:Lucene (Score:5, Informative)
Yes, they have. In my previous job we had to search 2 terrabytes of plain text data (HTML) really fast. The company chose Autonomy, and many developers spent many months trying to make it work, consuming insane amounts of hardware resources for mediocre results, and still requiring . One lone (and brilliant) dev whipped up a Lucene proof of concept in a weekend, and it was faster (full index in a day) required less resources (a single HP DL 585, 16GB RAM, 4xdual core AMD as opposed to 10 of the same), had a smaller index (about a 5th of Autonomies'), returned results faster, the result set was more accurate, and was significantly more flexible in making it do what we actually needed it to do.
Lucene wins hands down
Re:Lucene (Score:5, Funny)
Re: (Score:3, Interesting)
What is more complicated is to deal with large numbers of concurrent requests. Then you need clustering. There are big sites that do both partitioning and clustering simultaneously with Lucene. I seem to recall reading that Technorati uses Lucene on a c
Re: (Score:3, Informative)
Search software (Score:3, Informative)
Terrier - LINK [gla.ac.uk]
Indri/Lemur - LINK [lemurproject.org] / LINK [lemurproject.org]
MG - LINK [mu.oz.au]
Re: (Score:2)
Google (Score:3, Informative)
If you value your privacy, invest in a Google mini, though.
Re:Google (Score:5, Informative)
We even pointed it at the web-cvs server and bugzilla and it was great at searching those too.
To see all the bugs still open against v 2.2.1 or something like that bugzilla's own search was better. but for searching for "bugs about X" the google mini was great.
It only cost something like $3k ircc.
not exactly what you asked about, but you should definitely see if this wouldn't work for you instead.
Re: (Score:3, Interesting)
Re:Google (Score:4, Informative)
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Re: (Score:2, Insightful)
Meta tags placed? (Score:4, Insightful)
Generally, Lucene [apache.org] does a good job. It's easy to learn and performance was fine for me and my data (~ 2 GB of textual documents).
Meta tags are worthless, generally (Score:4, Insightful)
DON'T TRUST USERS TO ENTER META DATA!!!
I've worked in electronic document management in 3 different businesses and metadata entered by end users is worst than worthless - it is wrong. Searches that don't use full text for general documents are less than ideal.
Just to prove that you're question is missing critical data:
- how many documents?
- how large is the average and largest documents?
- what format will be input? PDF, HTML, XLS, PPT, OO, C++, what?
- what search tools do you use elsewhere?
- any budget constraints?
- did you look at general document management systems? Documentum, Docushare, Filenet, Sharepoint? If so, what didn't work with these systems?
- Did you consider OSS solutions? htdig, e-swish, custom searching?
- A buddy of mine wrote an article on "how to index anything" that was in the Linux Journal a few years ago. Google is your friend.
AND if i didn't get this across yet - DON'T TRUST META DATA IN HIDDEN DOCUMENT FIELDS - bad Metadata in MS-Office files will completely destroy the usefulness of your searches.
Re: (Score:3, Informative)
Unless you can pin responsibility for a document to a named person, you can't trust anything in the document. Not metadata, not content, not presentation.
T
Re: (Score:2, Insightful)
I don't understand the requirements.
I don't either, and that's because the submitter didn't give enough information. I'm working on a fairly large enterprise content management system for the feds (think 2.5 TB/month of new data), and I don't see any of the solution components we use mentioned in any thread yet. If I were being a responsible consultant, I'd want to know the answers to the following questions at minimum before making any recommendations:
Re: (Score:2)
Google Desktop or Applicance (Score:4, Insightful)
The inexpensive Google appliacances don't have very fine-grained access control, though. But I am involved in several semi-failed projects of this nature in my organization, but new and legacy, and my Google Desktop outperforms all of them.
Re: (Score:2)
Re: (Score:3, Informative)
I set up a Google Mini for indexing an internal wiki, our bug tracking system, and some other systems, and it is very straight-forward.
I know the original question mentioned meta-data, but you have to ask yourself if the meta-data is going to be maintained well enough that the search index will be valid. Going the Google Appliance route is so much simpler. It takes a bit of tweaking to set up the search res
Re: (Score:2, Interesting)
Meta-data is one of those things that seems like a really good idea, but like all plans, doesn't tend to survive contact with the enemy, which in this case is the user.
Like software development, the quality of the outcome is implementation dependent.
We run six thesauri, plus a number of different controlled lists for our users to input metadata. We don't publish any documents that don't have meta attached, and we perform random quality audits. Our users have been trained in fundamentals of classification and also in the payoff for getting it right.
We use metadata to structure our navigation for some sites, we depend on it for search for our internal documents. Our m
Personal GSA experience (Score:2)
The two exceptions are the way it handles secured documents (on our mostly-Windows network, that meant authenticating twice or doing complicated Kerberos stuff), and hardware (we've had two boxes fail with drive issues in the last year).
Still, when it comes to search results and speed, it's been very good. I'm also a fan of Google Desktop, but that's a completely different story and more difficult to centrally manage.
Re: (Score:2)
Swish-E (Score:3, Informative)
Re: (Score:2)
I don't think it even handles unicode, either.
Re: (Score:2)
And yes, it does not support UTF-8/Unicode/anything-non-ASCII-8 [swish-e.org].
But the developer list is quite active and responses are usually accurate (though they also can be terse and sometimes overly-authoritative).
Beagle, Spotlight? (Score:3, Insightful)
Re: (Score:2)
All depends on the document filters ... (Score:2)
BOFH (Score:2, Funny)
XML document formats (Score:3)
Re: (Score:2)
Re: (Score:2)
Google corporate search (Score:2)
what to avoid (Score:2, Insightful)
Re: (Score:2)
Evil, Evil, Evil!
At least the part that's not brain-damaged.
Re: (Score:2)
Most easy solution (Score:4, Informative)
Your not even realy required to use added tags... (as most people will put in poor tags).
But if you like you can add tags even with sharepoint.
Re:Most easy solution (Score:4, Informative)
I do a lot of LAMP development, and I'm not the strongest fan of Microsoft for a lot of things, but if you have a MS desktop and MS Office environment, SharePoint services really is quite decent for INTRANET applications. Especially for collaberation. You can set up work flows for check-out/check-in, and it integrates really nicely with some of the more recent MS Office releases. If you connect it to a real MS SQL server on the back end (as opposed to the express edition that it defaults to), you can have full text indexing even with the free SharePoint Services version. Only need for the full blown Portal/MOSS version is if you think you are going to have a large number of sharePoint sites, and want to simplify cross-connecting and management. (At least as far as I can recall)
I'm not saying SharePoint is the way to go, but I'd at least read up on it and consider it IF you have a lot of MS Office stuff that you plan on indexing/sharing.
I'd strongly advise avoiding it if you plan to do Internet-based stuff though... at lest until you get a good enough understanding of the security issues involved that you feel that you really know what you're doing.
Just my $0.02 worth.
Re: (Score:2)
Re: (Score:2)
It has basic sarch built in, and there is an upgrade path. IFilters are built in for the MS formats, and there is third party for non-MS formats.
One thing - Office Sharepoint designer essentially doesn't work. Just don't bother wi
Re: (Score:2)
I've never seen a SharePoint site where the search worked well at all (particularly in the document libraries.) You might think by its observable behavior that it is simply offering up documents at random instead of searching their contents.
Re: (Score:3, Informative)
Re: (Score:2)
Or build your own app on top of a Microsoft SQL Server 2005 with Full Text Search
Technet Article [microsoft.com]
No need for tags... let the document itself be the tags...
The free-as-in-beer express ed. (with advanced blahblah..) however is limited to 2Gb. So you will at the least need a Standard ed. though.
Upload it to the web (Score:2, Funny)
Check out Alfresco! (Score:3, Interesting)
Alfresco also has plugins for Microsoft Office so you can manage the repository from Word, etc. They are also working on OpenOffice integration.
Don't use SAMBA for
I am not affiliated with Alfresco, just a happy user.
Re: (Score:2, Informative)
I notice a lot of the comments in the thread are coming from developers or sysadmins who want to solve everything with libraries or command line tools. But it really sounds to me like you need a reasonable document management system (and of course being a slashdot reader you want it for free).
Again, I'm not affiliated with Alfresco, but did quite a bit of research into open source DMS's that would run in a java environment for a couple of recent projects. I found Alfresco
Re: (Score:2)
Also see Xapian (Score:5, Informative)
http://en.wikipedia.org/wiki/Full_text_search [wikipedia.org]
If you're not afraid to do a little reading and potentially coding a custom front end, you may want to look at two of the big open source engines: Lucene and Xapian.
Lucene is quite popular now, and is an Apache Java project. It's a good choice if you're a Java shop.
Xapian seems to be based on a little more solid and modern information retrieval theory and is incredibly scalable and fast. It's written in C++, with SWIG-based front ends to many languages. It might not have as polished of a front end or as fancy of a website as Lucene, but I believe it's a better choice if you have really really huge data sets or want to venture outside the Java universe.
There are also many other wholely-contained indexers too, mostly which are based on web indexing (they have spiders, query forms, etc.) all bundled together. Like ht://Dig, mnogosearch, and so forth. They are good, especially if you want more of a drop-in solution rather than a raw indexing engine, and if you're indexing web sites (and not complex entities like databases, etc).
Re: (Score:2)
Re: (Score:2, Informative)
I agree that Lucene is a great choice specifically for java shops, it does have ports for pretty much all major languages. The java implementation is the 'mothership' but you can use lucene with php, python, .NET or C++ or whatever.
Secondly, I'd like to point out Lemur [lemurproject.org]. It's an indexing engine similar to Lucene, but geared much more toward the language modeling approach of information retrieval. All IR approaches will use either a vector space based approach or a language model approach. Lucene does vec
Depends on size of document base (Score:3, Insightful)
Disclaimer: I flog Google search solutions at work, so I'm way biased.
Re: (Score:2)
use a Wiki instead (Score:5, Insightful)
If you work somewhere where people are not trained to attach Office docs to every email, where people don't use Word to compose 10 bullet points, where people don't use a spreadsheet as a substitute for all sorts of CRM and business applications... a Wiki is actually a good solution.
You can use something like MediaWiki or Twiki or... heck you can use a whole variety of content management systems.
The key to success is to *EMPOWER* people to actually update information, and have a few people who are empowered to actually edit, rehash, sort, move, prune wiki pages and content. As the content improves, it will draw in more users and more content creators. Pretty soon, employees will *COMPLAIN* when someone sends out information and doesn't update the wiki.
Some corporate cultures are not wiki-friendly. Some management chains *fear* the wiki. Some companies have whole webmaster groups who believe it is their job to delay the process of getting useful content onto the web by controlling it. If you're in one of those companies... start up your own wiki and beg for forgiveness later.
Re: (Score:3, Interesting)
Re: (Score:2)
True hackers (Score:2)
FileNet (Score:2)
The downside to the suggestions like Google Appliances is that you're then storing this information on Google servers.... something that most companies find HIGHLY objectionable (security).
Re: (Score:2)
I'll plug my software (Score:2)
$$$ - Universal Content Management (Score:2)
At work, and granted it's a fair size HMO, we use Universal Content Management by Oracle, formerly known as Stellent's Content Management System.
UCM allows for named accounts with control over access, plus a full audit log, full chan
Google (Score:2)
We got a Google "appliance" and the damn thing just works, and works well. I don't work for Google, nor do I get paid if they make a sale. Just saying what worked great for us.
Swish++, HyperEstraier (Score:2)
cough cough (Score:2)
Lucene is the way to go. There are APIs for Perl for dealing with Lucene data sets and for many other languages as well. Nutch is a good place to start getting to know the power of Lucene - you can get a nutch crawler interface up and running quickly and you can browse through some of the source files to get an understanding of how to bring in various file formats - Office documents, PDFs, etc.
The Google Search boxes are decent, bu
I do this in several programming languages (Score:3, Informative)
A good tool for getting plain text out of various versions of Word documents is the "antiword" command line utility.
The Apache POI project (Java) can read and write several Microsoft Office formats.
For indexing: I like Lucene (Java), Ferret (Ruby+C), and Montezuma (Common Lisp).
I have mostly been using Ruby the last few years for text processing. Here is a short article I wrote using the Java Lucene library using JRuby:
http://markwatson.com/blog/2007/06/using-lucene-with-jruby.html [markwatson.com]
Here is another short snippet for reading OpenOffice.org documents in Ruby:
http://markwatson.com/blog/2007/05/why-odf-is-better-than-microsofts.html [markwatson.com]
---
You might just want to use the entire Nutch stack:
http://lucene.apache.org/nutch/ [apache.org]
stack that collects documents, spiders the web, has plugins for many document types, etc. Good stuff!
If money's not an object... (Score:2, Informative)
Install Wumpus Search (Score:2)
Re: (Score:3, Informative)
Re: (Score:2)
IBM OmniFind (Score:2)
but, for small shops where there is no money to throw at this type of thing, try IBM OmniFind Yahoo! Edition. can't beat the price.
http://omnifind.ibm.yahoo.net/index.php [yahoo.net]
You may be asking the WRONG question... (Score:2)
on Spaghetti Sauces, which refers to Moskowicz's idea that:
There isn't a (one) best , only best
eg, best spaghetti sauceS, etc.
Some find one best, others find another best [for them]...
Whatcha think?
Access Control and Searching (Score:2)
This allowed for the full text to be searchable, provided a reference back to the original file. If it was in a digital format, like a Word document, it was also stored in the database. If it, it referenced a physical file. The user c
Hack Something Together? (Score:2)
It's a little slow-- 10 seconds to look up all instances of a symbol. And it takes ~3 hours to refresh the full index.
But is saves an enormous amount of time, makes impossible tasks possible, and I have used it every da
An answer and a question - metadata tagging? (Score:2)
But this raises my question - how do you enforce metadata tagging? I can't even find a decent Windows-based metadata browser/editor for after-the-fact b
Grep and strings (Score:2)
1/2 joking 1/2 serious.
grep -R "foobar" /
htDig (Score:2)
Google search (Score:2)
Or, you know, you could add meta data to each and every single page you want to index... I'd personally rather stab my eyes out with a ballpoint pen.
There are lots of choices... (Score:2)
Other people have said most of this already, but the ones I've seen used the most are:
So, what are the requirements? (Score:2)
Where I work, we've used htdig, Verity K2 and Google search appliance, and have looked at (and heard good things about) Lucene.
Which one I'd recommend would depend entirely on the answers to my questions.
Managing Gigabytes (Score:2)
To get more info including a peep into the book do a Google search on "Managing Gigabytes"
otoh for something cheap and cheerful there is htdig.
http://htdig.org/ [htdig.org]
It's remarkably good for indexing an intranet.
How about Oracle text (Score:2, Interesting)
you might try Oracle Text (also part of Oracle XE).
Supports 140 document formats, has a lot of options and works via SQL.
Can build indexes for documents stored in DB or in the file system.
You can even join the serach terms from the document with the database records where metadata might be stored by your application.
I found that very helpful in similar projects. And it's free.
Look at how you will access the docs first (Score:3, Informative)
Full-text indexing allows users to search the entire contents of documents, but the results are imprecise and voluminous and not terribly useful in most cases (think web search engines here). Yes, you can find all documents that contain the word "patent", but you get a lot of old references to patent leather shoes in addition to what you were probably after. So, with full-text search you get it all, but force the user to subsearch for what they really want.
Using meta-tags gives the appearance of pre-classifying documents and having the users do it themselves means you don't have to have a dedicated person to assign the tags. The disadvantage is that everybody makes up their own tags or if you have a standard set, you have to rely on people being diligent about applying them. And tag popularity can easily change over time. For example, if you want to find docs that refer to "removable media", this might have garnered a "floppy" tag 15 years ago and "CD" or "DVD" today. You are therefore almost guaranteed of missing some documents using this method.
Database indexing means that you list all your docs in a database, perhaps by title, author, date, or other fields that your users would find useful for searching. The advantage is that every doucment is indexed the same way, searching is really fast, and the results are usually relevant if your schema is meaningful. The disadvantages are that indexing the docs takes work on input and users need to know how to search to get the best results.
Finally, you could organize the docs by simple name and folder. This works fine for the desktop and users usually can identify the category that points them to the folder they want. The disadvantage is that this only works well for limited document sets. Once you start getting hundreds of categories and thousands and thousands of documents, things become too hard to find.
So - understand your users search requirements and the size of your expected database. Only then can you make an informed decision about how to create and index the repository.
Re: (Score:3, Funny)
(I'm sold anyway)
Re: (Score:3, Informative)
I've deployed it across a handful of servers, and it does a good job of crawling, but doesn't do well with javascript. If you have javascript for your web's frontend, you can write a shell script to find . -print, prepend the urls into a file, and point htdig at that file. It will dig into each file it finds, and create a searchable database of everything that it finds.
You add
Re;Easy off the shelf (Score:2, Informative)
Re: (Score:2)