Understanding Search Engines?

Understanding Search Engines? 49

Posted by Cliff on Saturday February 04, 2006 @07:30PM from the a-primer-on-the-technology-behind-them dept.

An anonymous reader asks: "I guess by now we can be fairly certain that search engines are here to stay, and hence I'm trying to understand how the technology works. I'm not so much looking for a particular 'best' technology or implementation, but rather an overview of the different approaches and their trade-offs. Something that would teach me: which approach works in a distributed vs a centralized infrastructure; how different algorithms will perform on complete search words vs arbitrary sub-strings; or how mass storage (hard disk vs. solid state) affects implementation choices. For most mature technologies there is a host of 'overview' books and papers for my questions -- but I couldn't find anything on search engines. Where should I look? Are there any good books or papers?"

Understanding Search Engines?

This discussion has been archived. No new comments can be posted.

Search 49 Comments Log In/Create an Account

Comments Filter:

Start off by reading up aboutPage Rank (Score:5, Informative)

by xmas2003 ( 739875 ) * writes: on Saturday February 04, 2006 @07:35PM (#14643548) Homepage

Google has a summary here [google.com] ... but start with the original Brin and Page paper - The Anatomy of a Large-Scale Hypertextual Web Search Engine. [stanford.edu]
Same basic concepts apply today ... although they probably didn't anticipate the rise of Black Hat SEO [watching-paint-dry.com] which attempts to "beat" the algorithms.

Google founder's links to publications (Score:5, Informative)

by zoeblade ( 600058 ) writes: on Saturday February 04, 2006 @07:45PM (#14643590) Homepage

One of the founders [stanford.edu] of Google still has links to various publications (in PostScript format) about search engines, if that helps.

Easy Start (Score:3, Informative)

by MikeFM ( 12491 ) writes: on Saturday February 04, 2006 @07:49PM (#14643609) Homepage Journal

Try the web. I have a short intro to search engines [kavlon.org] on my website. Many others exist. The basics aren't hard and are very effective.

Managing Gigabytes (Score:5, Informative)

by cariaso1 ( 674515 ) writes: on Saturday February 04, 2006 @08:03PM (#14643658) Homepage

Managing Gigbytes author site [mu.oz.au] Amazon [amazon.com]

is a spectacular book on most of the underlying technologies. Although I've only read the first edition, I don't recall it talking about spidering/webcrawling. Instead it starts with building a simple index, and builds through all the refinements (ie stemming, etc) until you've built a serious workhorse for mining text documents. Its definitely at the core of what a search engine does,

Re:Related Question: (Score:3, Informative)

by Denyer ( 717613 ) writes: on Saturday February 04, 2006 @08:33PM (#14643731)

Write in the language of the users you expect to use your site, and look at the server logs to see what terms people were using when they found your site.

If you sell poultry medication, for example, there's no point in only labelling products as being for cockerels if your visitors are more likely to use rooster as a keyword. You might also want to refrain from putting "cock pills" in your meta tags...

Other than that, write semantically valid code (header tags, etc) and don't put large blocks of navigation links first in a document -- you want search engines to concentrate more on the unique content of a page.

All common sense stuff, really.

Look at Open Source projects (Score:5, Informative)

by X ( 1235 ) writes: <x@xman.org> on Saturday February 04, 2006 @09:19PM (#14643842) Homepage Journal

So, aside from reading books on Information Retrieval and Data Mining, the other easily available reference are open source search engines. In particular, look at the Nutch project [apache.org], which is actually a pretty high quality search engine implementation. Even better: start contributing to the project.

some info... (Score:3, Informative)

by Pavel Stratil ( 950257 ) writes: on Saturday February 04, 2006 @11:11PM (#14644157) Homepage Journal

If you really want to go behind the theory, you will want to start here [psu.edu]. But be prepared to have some really good skills in math, statistics, computer sciences and system administration to understand the articles as they are not intended for general public.

A brief intro of how classical search engines work goes as follows:
Grabbing: A crawler visits pages which it considers important, downloads them and parses them
Analysing: The document receives an identification string and is stored in a reversed index, which is simply a database table with culomns such as "word", "document", "possition", "importance". The "word" culomn is indexed and used for searching.
Searching: Say that you search for the phrase "ask slashdot". The search engine searches the lines with the terms "ask" and "slashdot", looks into the "document" cell and selects only those documents that both terms occure in. Then it looks into the "possition" cell which carries all the possitions of the searched word in each document and discards all the documents that do not have successive "ask" and "slashdot" terms possitions. The resulting documents are then sorted according to the importance cells of the searched terms.

This is how basically all search engine works. The only major difference is usually only in the math used to compute the imprtance. There are also some major optimisations done to speed up the responses. To discuss this would take too long. So if you have any questions feel free to ask. Currently I am part of a team developing a large scale search engine, so you have a chance to get some hot info here :)

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Understanding Search Engines? 49

Understanding Search Engines? More Login

Understanding Search Engines?

Start off by reading up aboutPage Rank (Score:5, Informative)

Google founder's links to publications (Score:5, Informative)

Easy Start (Score:3, Informative)

Managing Gigabytes (Score:5, Informative)

Re:Related Question: (Score:3, Informative)

Look at Open Source projects (Score:5, Informative)

some info... (Score:3, Informative)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot