Indexing and Searching Text and PDF Repositories? 10
foobar104 asks: "My company is trying to set up a permanent archive of PDF documents, lots and lots of PDF documents. The goal is to be able to collect a ton of documents and assemble them into a directory structure, then burn the whole tree to a DVD-ROM (or similar medium) for permanent storage. Of course, 4 GB of PDFs would be pretty useless if you couldn't find what you were looking for, so I want to include one or more indices and a search program on the disk. I wanted to put Glimpse to the task, but it looks like somewhere along the line the Glimpse folks turned commercial, so they're no longer my first choice. (Unless there's still a source for Glimpse 4.0 source code?) Search tools like ht://dig and such are fine, but they're designed to index hyperlinked web content, and this collection isn't organized that way. Are there any open-source tools around for indexing and searching, with a command-line-interface, plain-text and PDF documents that aren't tied to HTML/HTTP?"
Lots of PDF search engines (Score:1)
I'm not a huge fan of PDF in general, and it's tricky to index because the files tend to be big, so you can get word co-occurances where they don't have much to do with one another.
That said, I've found a lot of PDF search engines, my current favorite is PDFWebSearch whichis smart about PDF pages and should handle local file system indexing as well as web site crawling. I have a whole page all about PDF search engines, including links to all the ones I know about at the SearchTools PDF searching report [searchtools.com].
If you want to get in touch with the Glimpse people, I have a page [searchtools.com] about that too.
Avi
Re:Reply to my own question: non-HTTP indexing (Score:1)
If ht://Dig does most of what you want, why not make it understand file:// URL's if it doesn't now? I wouldn't expect it to be _that_ difficult an addition, especially if you are concerned about the availability of source code.
Otherwise, why not just pay up the US$2k and buy a Glimpse site license? US$2k is probably worth about _two_ days of your time to the company (depending on the size of the company). That would cover everything as long as you kept it to one site.
Glimpse also appears to be available with source too, check out the bottom of the download [webglimpse.net] page.
If the tool is to be used to create a commercial product that you are going to sell, then you won't be covered by that. Personally, if I was going to be using a GPL'ed (or other open source) tool as the core of a product I was going to sell, I would feel ethically bound to pay some money for it, regardless of it's availability for "free".
Jason PollockConvert PDF files to text, index that (Score:1)
I'm currently doing this, and it works pretty well. I use pstotext [compaq.com] run by a perl script of my creation that makes sure that every .pdf file has a corresponding .pdf2.html file (html, txt, doesn't matter -- both are easily indexed.)
I then use glimpse [webglimpse.org] to index the .html (or .txt, again, doesn't matter) those. Once I have the search results, a perl script merely replaces .pdf2.html with .pdf, and it all works fine. Yes, there's an extra file on the file system, one for each pdf file, but nobody notices.
Just so there's no confusion, glimpse is a tool for indexing text files on a file system. webglimpse is a tool for indexing a web site -- and it uses glimpse in the `back end'. Ultimately, since a web site often looks like a file system, the two problems -- indexing a file system and indexing a web site -- are very similar.
glimpse is not free for commercial use, so you may want to use another tool. swish++ [mac.com] comes to mind, but I've not done much with it.
Reply to my own question: non-HTTP indexing (Score:1)
With all respect to everybody who has replied so far, I'd just like to take a second to emphasize something that I probably could have made clearer in my question.
The PDF collection I want to index and search is not available as a web site that can be spidered with HTTP. As far as I can tell from looking at the docs, ht://dig (for example) is strictly a web-site-spidering tool, and that's not what we need.
I'm looking for a tool to help me index a set of files-- not web pages-- then search the index after the content has been committed to DVD-ROM (or some similar medium). ht://dig is not appropriate for doing that, as best I can tell.
Thanks, but maybe we could try to focus on tools that index files via filesystem interfaces, rather than HTTP calls.
Open Directory (Score:2)
MS Index Server (Score:2)
The MS Index Server supports indexing PDF files with an extension from Adobe. These files can be anywhere, not necessarily on a web site or linked to by web pages. However, it is quite easy to search the index (which support free text searching) via the web. Its a fairly easy solution to setup too....
ÕÕ
Google (Score:2)
PLWeb Turbo, PLWeb-CD, Personal Librarian (Score:2)
The main problem I see is that it is a) no longer currently developed and b) platform-limited. However, it is a great product.
I'd switch if I could find another search engine which would do simple things like phrase/near searching, fielded searches, etc., etc., etc.,. The main problem is that most of the search engines are good for doing web-like searches where you don't need a lot of control over the search. Perhaps some of the other readers will be able to help.
I've thought several times about just writing a new search engine. Then I think about what is really involved..... :)
I have just the thing (Score:2)
It looked like they were publishing everything in their company in PDF, and avoiding other office document formats entirely. In any case, Fulcrum apparently supports just about everything you asked for.
It comes with a desktop app which everyone was previously using to perform the searches. All documents were categorized by metadata such as date, creator, maintainer, and a whole tree of categories.
My job in this case was to integrate the search capabilities into their intranet site, a task which was surprisingly simple once I was able to track down the documentation and libraries (of the correct version, mind) I needed. Seemed to be a pretty powerful product, if managed well. I have absolutely no idea how much it cost or exactly how it worked, I just interfaced with it. However, the documents were available at the file level by network shares, so it sounds like it'll do what you ask.
ht://Dig is what you are looking for (Score:5)
And beside being flexible, ht:/Dig is fast.
4.9. How do I index PDF files?
http://www.htdig.org/FAQ.html#q4.9 [htdig.org]