Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
Apache Software

Topical AND Keyword Based Search Engines? 7

jhughes asks: " I work for a public university and have just recently 'acquired' the project of switching a community Web page (for the city and surrounding counties/cities) to a new machine running Apache. I'm relatively new to Apache but thus far this has not been a problem. The real problem comes from the request of a search engine for the Web sites information. This site will contain a large amount of public information, such as rental property, council meeting minutes, and so on. They are requesting that the search engine be able to scan the entire site for all the data, or specific subtopics (such as 'City Government', 'City Council', or 'City Council Meeting Minutes', with searches limited to each of those areas if need be). To me this sounds similar to what Yahoo has done in some cases (You can search Yahoo, or Yahoo US States, or Yahoo US States California, or specific cities, etc). I honestly have no clue what search engine to use for this. Are there commercial search engines that one could purchase and adapt, or is there a free/open source one out there that would work just as well? There will eventually be a huge amount of information (as all government sites seem to accumulate) online, so it'd need to be able to handle large indexes. Any help, advice, or suggestions would be greatly appreciated."
This discussion has been archived. No new comments can be posted.

Topical AND Keyword Based Search Engines?

Comments Filter:
  • by Canth ( 2796 )
    Didn't Yahoo recently switch to Google [google.com]?
  • There is a version of AltaVista, available as a commercial product, which would nicely meet the needs you describe. It supports filtering by category as well as keyword searches (along with many other features).

    Take a look at http://solutions.altavista.com [altavista.com] under the heading AltaVista Search Engine 3.0.

    A little disclaimer: I work as a developer for that product, so of course I'm biased; however, it really is very powerful.


    But my grandest creation, as history will tell,

  • I'll second the suggestion for Thunderstone. It's not cheap, but it is *quick*. I've written different search engines with it. I'm writing the code for one now. I've yet to see a sql backend work with text as quick as texis does.


    You can download their webinator demo and use it as search mechanism on your own website. See their website for more information.

  • If you lay out your URLs properly, ht://Dig [htdig.org] may be enough; it allows you to restrict searches by URL pattern. See the docs for the restrict [htdig.org] keyword.
  • I wouldn't concentrate too much on what search engine software to use, since it didn't really sound like you'd built out much of a requirements list of what you wanted it do. Think about how you wanted it to function and then look for the tool that matches accordingly. I would even go so far as to create a search page, and a results page. A good place to start would be searchtools.com. [searchtools.com]


    As for building out a Topical or Hierarchical structure like Yahoo's or DMOZ [dmoz.org], you need to have meta data about the document. You can pull meta data from a URL, as was suggested earlier, but I wouldn't advise that. In order for this to work, your URL's start to look like 'http://city.gov/CityCouncil/MinutesOfMeeting/2000 /10/15/index.htm', which also makes it difficult to move things around, and makes it difficult to return the string 'CityCouncil', properly formatted when you build the hierarchy on the fly. If you include the hierarchy information in the page, you can get around some of these problems. I think the standard way to do this would be using meta tags. View the source of this page [intelligen...rprise.com] for an example of this. The downside to this is that you have to structure your hierarchy in advance, which goes back to getting & building requirements early, you also have to convert old documents to include these tags, and you have to make sure that future documents will have these tags properly implemented. If you're using URL paths to qualify the documents, you may be able to take advantage of an already existing directory structure to build your heirarchy. I just don't like using document locations to figure out what a document is about--its always seemed like a brittle solution to me. On the other hand, I've used the URL string to help make a first pass at placing meta tags into documents, usually using a Perl script. I still had to go back in and check the documents though.


    I've used both FreeWAIS-SF and Verity [verity.com] for implementing searches like this, as well as home-grown solutions. I wouldn't advise using Verity since I think that it's prohibitively expensive, unless it has some feature that you require and are willing to pay for it. And I didn't think the home-grown solutions worked as well as the off the shelf products that were customized.

  • Your project sounds like a project for Thunderstone's Texis [thunderstone.com] system. I've worked with quite a few other search engines, and for my projects, Texis always beat the other players in performance/flexibility.

    1. Keyword/Phrase searching (and,or,must include,must not include,must include N or more, etc.)

    2. Prefix and suffix processing

    3. Search by regular expressions

    4. Concept searching with precompiled thesaurus as well as a user-defined thesaurus (you can customize it for your industry's jargon, for example).

    5. Integrated P-Code compiled scripting language

    6. Available for a bunch of platforms

    7. Goes like a bat out of hell on huge indexes

    8. Free [beer] version available for smaller projects

    Their client list is pretty impressive; Ebay, NASA, ZDNet, etc., and lots of city/state sites. The downside is that it's not cheap: $5000 to $10000 or more just to start.

    My only connection to Thunderstone is my prior use of their product.

    PDHoss
    ======================================
  • There is a VERY good search engine out there (BUT NOT cheap)

    Look up Excalibur Technologies - You can get Keyword, Theme, and concept searches out of it

    Disclaimer: It's one of the search engines we use at work - Other then the fact the bought me lunch once, no other connections

On the eighth day, God created FORTRAN.

Working...