Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
×
Unix Operating Systems Software

Indexing and Searching Text and PDF Repositories? 10

foobar104 asks: "My company is trying to set up a permanent archive of PDF documents, lots and lots of PDF documents. The goal is to be able to collect a ton of documents and assemble them into a directory structure, then burn the whole tree to a DVD-ROM (or similar medium) for permanent storage. Of course, 4 GB of PDFs would be pretty useless if you couldn't find what you were looking for, so I want to include one or more indices and a search program on the disk. I wanted to put Glimpse to the task, but it looks like somewhere along the line the Glimpse folks turned commercial, so they're no longer my first choice. (Unless there's still a source for Glimpse 4.0 source code?) Search tools like ht://dig and such are fine, but they're designed to index hyperlinked web content, and this collection isn't organized that way. Are there any open-source tools around for indexing and searching, with a command-line-interface, plain-text and PDF documents that aren't tied to HTML/HTTP?"
This discussion has been archived. No new comments can be posted.

Indexing and Searching Text and PDF Repositories?

Comments Filter:
  • I'm not a huge fan of PDF in general, and it's tricky to index because the files tend to be big, so you can get word co-occurances where they don't have much to do with one another.

    That said, I've found a lot of PDF search engines, my current favorite is PDFWebSearch whichis smart about PDF pages and should handle local file system indexing as well as web site crawling. I have a whole page all about PDF search engines, including links to all the ones I know about at the SearchTools PDF searching report [searchtools.com].

    If you want to get in touch with the Glimpse people, I have a page [searchtools.com] about that too.

    Avi

  • If ht://Dig does most of what you want, why not make it understand file:// URL's if it doesn't now? I wouldn't expect it to be _that_ difficult an addition, especially if you are concerned about the availability of source code.

    Otherwise, why not just pay up the US$2k and buy a Glimpse site license? US$2k is probably worth about _two_ days of your time to the company (depending on the size of the company). That would cover everything as long as you kept it to one site.

    Glimpse also appears to be available with source too, check out the bottom of the download [webglimpse.net] page.

    If the tool is to be used to create a commercial product that you are going to sell, then you won't be covered by that. Personally, if I was going to be using a GPL'ed (or other open source) tool as the core of a product I was going to sell, I would feel ethically bound to pay some money for it, regardless of it's availability for "free".

    Jason Pollock
  • The easiest way I've found to do this is to convert your PDF files into a `text file approximation' of the original PDF file and index that.

    I'm currently doing this, and it works pretty well. I use pstotext [compaq.com] run by a perl script of my creation that makes sure that every .pdf file has a corresponding .pdf2.html file (html, txt, doesn't matter -- both are easily indexed.)

    I then use glimpse [webglimpse.org] to index the .html (or .txt, again, doesn't matter) those. Once I have the search results, a perl script merely replaces .pdf2.html with .pdf, and it all works fine. Yes, there's an extra file on the file system, one for each pdf file, but nobody notices.

    Just so there's no confusion, glimpse is a tool for indexing text files on a file system. webglimpse is a tool for indexing a web site -- and it uses glimpse in the `back end'. Ultimately, since a web site often looks like a file system, the two problems -- indexing a file system and indexing a web site -- are very similar.

    glimpse is not free for commercial use, so you may want to use another tool. swish++ [mac.com] comes to mind, but I've not done much with it.

  • With all respect to everybody who has replied so far, I'd just like to take a second to emphasize something that I probably could have made clearer in my question.

    The PDF collection I want to index and search is not available as a web site that can be spidered with HTTP. As far as I can tell from looking at the docs, ht://dig (for example) is strictly a web-site-spidering tool, and that's not what we need.

    I'm looking for a tool to help me index a set of files-- not web pages-- then search the index after the content has been committed to DVD-ROM (or some similar medium). ht://dig is not appropriate for doing that, as best I can tell.

    Thanks, but maybe we could try to focus on tools that index files via filesystem interfaces, rather than HTTP calls.

  • The retrieval section is here [dmoz.org].
  • I know this isn't the free search solution you are looking for (although IIS and index server are free), but it may be useful to others.

    The MS Index Server supports indexing PDF files with an extension from Adobe. These files can be anywhere, not necessarily on a web site or linked to by web pages. However, it is quite easy to search the index (which support free text searching) via the web. Its a fairly easy solution to setup too....

    ÕÕ

  • I see that google is now indexing pdf's. Maybe you can check out what they are using.


  • I use PLWeb Turbo (http://www.pls.com/ [pls.com]) which is free. There is also a version for burning CD's.

    The main problem I see is that it is a) no longer currently developed and b) platform-limited. However, it is a great product.

    I'd switch if I could find another search engine which would do simple things like phrase/near searching, fielded searches, etc., etc., etc.,. The main problem is that most of the search engines are good for doing web-like searches where you don't need a lot of control over the search. Perhaps some of the other readers will be able to help.

    I've thought several times about just writing a new search engine. Then I think about what is really involved..... :)

  • When working as a consultant, I came across a company which was using a product called 'Fulcrum KnowledgeServer', made by Hummingbird [hummingbird.com]. I can't go into too many details because of an NDI, but let's just say they had what appeared to be hundreds of thousands of documents in PDF all over their file servers.

    It looked like they were publishing everything in their company in PDF, and avoiding other office document formats entirely. In any case, Fulcrum apparently supports just about everything you asked for.

    It comes with a desktop app which everyone was previously using to perform the searches. All documents were categorized by metadata such as date, creator, maintainer, and a whole tree of categories.

    My job in this case was to integrate the search capabilities into their intranet site, a task which was surprisingly simple once I was able to track down the documentation and libraries (of the correct version, mind) I needed. Seemed to be a pretty powerful product, if managed well. I have absolutely no idea how much it cost or exactly how it worked, I just interfaced with it. However, the documents were available at the file level by network shares, so it sounds like it'll do what you ask.
  • We index dozens of gigs of txt, html, pdf, xls, doc and ps. Not 100% of the documents are indexed but it's a parser problem with some of the files (a few pdf, xls doc and ps seem to make their parser choke).

    And beside being flexible, ht:/Dig is fast.

    4.9. How do I index PDF files?
    http://www.htdig.org/FAQ.html#q4.9 [htdig.org]

"What man has done, man can aspire to do." -- Jerry Pournelle, about space flight

Working...