Open Source Analog to Microsoft's Index Server?

Open Source Analog to Microsoft's Index Server? 38

Posted by Cliff on Sunday July 07, 2002 @09:01AM from the anything-but-microsoft dept.

An Anonymous Coward asks: "I have been tasked by my noble employer to find a better way accessing the 4,000 odd management documents and procedures we have. Currently MS Index Server is being used to provide a fairly good searching system. Index Server (for those that don't know) trawls through files and indexes their content.. ASP is then used to search the resulting database. My question is, there has to be a way to do this with nice open source software? Does anyone know of any competitors to index server that can index microsoft office documents? Thanks!" Might not HT://dig be a good foundation on which to build such a system?

Open Source Analog to Microsoft's Index Server?

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 38 Comments Log In/Create an Account

Comments Filter:

Not Open Source, but... (Score:2, Interesting)

by Fifster ( 590419 ) writes:

It's not open source, but Sherlock for MacOS (part of the OS) has always featured hard drive or folder indexing features that can scan contents of documents fairly quickly and efficiently. I've not seen its performance on a /huge/ archive, though.
--Fifster
Google? (Score:5, Interesting)

by isorox ( 205688 ) writes: on Sunday July 07, 2002 @09:13AM (#3836475) Homepage Journal

Dont google license their engine (which reads word, powerpoint etc?)

- Re:Google? (Score:4, Interesting)
  
  by Naikrovek ( 667 ) writes: <jjohnson@pWELTYsg.com minus author> on Sunday July 07, 2002 @09:30AM (#3836503)
  
  Yes: http://www.google.com/appliance/ [google.com]
  
  However, this is not open source, it is not free and doesn't at all meet the goals of the person asking the question.
  
  - Re:Google? (Score:1)
    
    by spencerogden ( 49254 ) writes:
    
    If I had to make a choice I would choose to deal with google rather than microsoft...
    - Re:Google? (Score:1)
      
      by malakai ( 136531 ) writes:
      
      The Google Search Appliance starts at US $28k. Index Server comes as part of Win2k, and not in bulk would be a bout 1/28th the cost.
- Re:Google? (Score:3, Funny)
  
  by 4/3PI*R^3 ( 102276 ) writes:
  
  Place all of your sensitive corporate documents on your web server then place a lot of links to the pages so that Google will index them. After this you can simply go to Google and type {search phrase} site:{company domain} filetype:{file type}.
  
  -- for the irony impaired this is humor --
  - Re:Google? (Score:2)
    
    by AirLace ( 86148 ) writes:
    
    -- for the English impaired this is irony --
two I tried (Score:3, Interesting)

by epine ( 68316 ) writes: on Sunday July 07, 2002 @09:28AM (#3836499)

I tried mnogosearch and swish-e. Different plusses and minuses. Later on I discovered that mnogosearch has a PHP front end and can be installed from a Debian package.

My advice is to set up two entirely different search databases. Otherwise it's very difficult to compare hits, ranking performance, or discovered differences in the lexeme policy.

- Re:two I tried (Score:1)
  
  by shanx24 ( 232938 ) writes:
  
  for the benefit of posterity, can you list any specific plusses and minuses u found while testing the two? thanks..
Ironic (Score:5, Insightful)

by Anonymous Coward writes: on Sunday July 07, 2002 @10:30AM (#3836659)

I realize this is a little extreme, but what the heck:

Imagine if the poster's company didn't have all their documents in a proprietary format. They would have plenty of other indexing programs available to them.

And think, if that gigantic percentage of businesses didn't have their information trapped in a proprietary information format, there'd be even MORE solutions in the marketplace to choose from.

When you don't come up with a cheaper and quicker solution, be sure to let your boss know it has just a little something to do with a proprietary format on a proprietary platform sold by a monopolist.

Happy Sunday!

- Re:Ironic (Score:2)
  
  by david duncan scott ( 206421 ) writes:
  
  Hmm...I read the post again, and he doesn't actually say in what format the documents are stored. Maybe they're all ASCII text, and he just doesn't know any of those "plenty of other indexing programs available to them". Imagine if you could mention a couple, for his benefit and for the rest of us, and then go off about his employers presumed document storage.
  I've always wondered whether M'soft themselves use Index Server. God knows the KnowledgeBase search is awful, and if they do eat their own dog food it's a terrible advertisement.
  - Re:Ironic (Score:2, Informative)
    
    by ericski ( 20503 ) writes:
    
    The original post does specifically state "microsoft office documents" so it is fair from likely that they're ASCII text.
    - Re:Ironic (Score:2)
      
      by david duncan scott ( 206421 ) writes:
      
      You know what? You're right -- in that last sentence, he does say "microsoft office documents". I stand corrected, and withdraw my indignation and scorn (except towards the KnowledgeBase index -- that still sucks.)
      - Re:Ironic (Score:2)
        
        by jhoffoss ( 73895 ) writes:
        
        Yeah, but the guy still didn't say what you could use if the files _were_ all ASCII...
HT://Dig (Score:2)

by tzanger ( 1575 ) writes:

Have you ever tried using an HT://dig search? I despise that search tool on the basis that the results it throws back are not ranked all that well and (this is easily fixed) ugly.

It's been a while since I've checked it out, maybe it has improved.
- Re:HT://Dig (Score:1)
  
  by Parsec ( 1702 ) writes:
  
  and (this is easily fixed) ugly
  
  Agreed, I have wrapped my htdig in a PHP readfile( ); call and used some javascript in the htdig header and long templates to do alternating table row background colors (don't forget the <noscript><tr></noscript> as a fallback if you do this).
ASPSeek worked well for me.. (Score:5, Informative)

by maeglin ( 23145 ) writes: on Sunday July 07, 2002 @12:18PM (#3837015)

I had about 2GB of documentation dumped onto me for a project. The documentation had no visible structure nor any place to really start tackling it so I decided to just index it all. The documentation was on my Windows2000 machine and I put ASPSeek [aspseek.org] (a GPL'd search engine that no one seems to know about) on one of my Linux workstations. I used pdftotext and word2txt as filters and let it chew through the documentation. The results were good enough that, when I left the project and shut down the ASPSeek interface, it took about 15 minutes before someone (who already had it all indexed on his Windows2000 workstation) was at my desk trying to get me to turn it back on.

Namazu and Bool (Score:2, Interesting)

by rhkramer ( 61840 ) writes:

Check out this page on twiki.org: http://twiki.org/cgi-bin/view/Codev/SearchEngineVs GrepSearch -- it discusses some search engines that have been / are being considered to replace the grep based search on TWiki.

To me, Namazu and Bool sound promising, but some others are discussed there as well.

TWiki is a Perl and cgi based wiki, and Namazu seems to be able to integrate into a .cgi based environment quite well, and can index Word documents.

Hope this helps!
Zope and DocumentLibary (Score:2, Interesting)

by mwr ( 12650 ) writes:

Haven't tried the latter, but it may fit the bill. DocumentLibrary home [zope.org]
Apache Lucene (Score:4, Informative)

by danpat ( 119101 ) writes: on Sunday July 07, 2002 @06:58PM (#3838391) Homepage

I highly recommend taking a look at the Apache Lucene Project, at http://jakarta.apache.org/lucene/ [apache.org]

It's a full text search engine API, so some coding for your specific requirements would be required. However, it's fast, extremely flexible, and has a pluggable interface for documents. It comes with native support for plain text, and for proprietry document types, we've written simple wrappers around tools like "pdf2text" and "catdoc" to index PDF's and Word docs.

- and Apache POI (Score:2, Informative)
  
  by tpv ( 155309 ) writes:
  
  POI (http://jakarta.apache.org/poi [apache.org]) is an MS Office file reader.
  Much talk has been made of intergrating Lucene + POI to provide indexing of MS Office Docs, but I don't what stage that is at.
Glimpse (Score:1)

by Aknaton ( 528294 ) writes:

I know Scot Hacker used to use Glimpse to do searches on www.betips.net. From the brief research I have done on Glimpse, it would work well if you mainly have text files.
glimpse (Score:2, Interesting)

by stor ( 146442 ) writes:

Hello there.

I have done all of this before in a commercial environment using Glimpse and Perl.

I'd recommend you check out glimpse and webglimpse. They ought to do what you are after, for free.

Cheers
Stor
Some Perl Engines (Score:3, Interesting)

by agentZ ( 210674 ) writes: on Monday July 08, 2002 @12:22AM (#3839693)

I don't know if you'd consider using Perl, but I've had some good luck with the Fluid Dynamics Search Engine [xav.com]. By default it can search text and PDF documents, and after some work I was able to get it to search the text of Microsoft Word documents too.

Try Xapian (Was commercial) (Score:2, Interesting)

by samjam ( 256347 ) writes:

Try xapian, www.xapian.org, about to undergo it's first release.

It is based on an temporary open-source release of one of SmartLogik's products.

I swear by it and find it highly flexible.

I guess, though, unless you are a hacker - say capable of using to actually index your documents, you might want to wait for the next release.

I use it in preference to htdig, swish++ and others I have looked at and sadly left; xapian is very fast and easily passes the 2G limit systems such as swish++ suffer from, and supports dynamic aggregation of multiple indexes into one search!

Sam

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Open Source Analog to Microsoft's Index Server? 38

Open Source Analog to Microsoft's Index Server? More Login

Open Source Analog to Microsoft's Index Server?

Not Open Source, but... (Score:2, Interesting)

Google? (Score:5, Interesting)

Re:Google? (Score:4, Interesting)

Re:Google? (Score:1)

Re:Google? (Score:1)

Re:Google? (Score:3, Funny)

Re:Google? (Score:2)

two I tried (Score:3, Interesting)

Re:two I tried (Score:1)

Ironic (Score:5, Insightful)

Re:Ironic (Score:2)

Re:Ironic (Score:2, Informative)

Re:Ironic (Score:2)

Re:Ironic (Score:2)

HT://Dig (Score:2)

Re:HT://Dig (Score:1)

ASPSeek worked well for me.. (Score:5, Informative)

Namazu and Bool (Score:2, Interesting)

Zope and DocumentLibary (Score:2, Interesting)

Apache Lucene (Score:4, Informative)

and Apache POI (Score:2, Informative)

Glimpse (Score:1)

glimpse (Score:2, Interesting)

Some Perl Engines (Score:3, Interesting)

Try Xapian (Was commercial) (Score:2, Interesting)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot