

Understanding Search Engines? 49
An anonymous reader asks: "I guess by now we can be fairly certain that search engines are here to stay, and hence I'm trying to understand how the technology works. I'm not so much looking for a particular 'best' technology or implementation, but rather an overview of the different approaches and their trade-offs. Something that would teach me: which approach works in a distributed vs a centralized infrastructure; how different algorithms will perform on complete search words vs arbitrary sub-strings; or how mass storage (hard disk vs. solid state) affects implementation choices. For most mature technologies there is a host of 'overview' books and papers for my questions -- but I couldn't find anything on search engines. Where should I look? Are there any good books or papers?"
Start off by reading up aboutPage Rank (Score:5, Informative)
Same basic concepts apply today ... although they probably didn't anticipate the rise of Black Hat SEO [watching-paint-dry.com] which attempts to "beat" the algorithms.
Foul Deeds... (Score:1)
Google are lying (Score:2)
Learn math (Score:5, Insightful)
Re:Learn math (Score:1)
Google founder's links to publications (Score:5, Informative)
One of the founders [stanford.edu] of Google still has links to various publications (in PostScript format) about search engines, if that helps.
Class (Score:3, Insightful)
Easy Start (Score:3, Informative)
Re:Easy Start (Score:2)
Re:Easy Start (Score:2)
Re:Easy Start (Score:2)
SEO can give you an insight into how particular search engines order the results they return. But it doesn't give you any idea how those search engi
Re:Easy Start (Score:2)
Managing Gigabytes (Score:5, Informative)
is a spectacular book on most of the underlying technologies. Although I've only read the first edition, I don't recall it talking about spidering/webcrawling. Instead it starts with building a simple index, and builds through all the refinements (ie stemming, etc) until you've built a serious workhorse for mining text documents. Its definitely at the core of what a search engine does,
One interesting (and perhaps unexpected) place (Score:4, Interesting)
Comment removed (Score:4, Interesting)
Re:Related Question: (Score:3, Informative)
If you sell poultry medication, for example, there's no point in only labelling products as being for cockerels if your visitors are more likely to use rooster as a keyword. You might also want to refrain from putting "cock pills" in your meta tags...
Other than that, write semantically valid code (header tags, etc) and don't put large blocks of navigat
Re:Related Question: (Score:1)
It depends on what you want to achive tho. GameSpy always uses "GameSpy is the most complete source for [insert game here] trailers, screenshots, cheats, walkthroughs, release dates, previews, reviews,
Re:Related Question: (Score:3, Insightful)
Web crawlers can't see the text in your images and weird HTML constructions can make it hard to parse the text back out. If your page content can be clearly expressed in plain text there's a good chance a search engine will know what you're talking about.
As an added bonus, if a web crawler can read your pages so can blind users.
Re:Don't Reinvent The Wheel (Score:2)
If the question was "I need to create a search engine for work, where should I start?" your answer would have been good advice. But the poster made no reference at all to creating his/her own search engine. Specifically: "...hence I'm trying to understand how th
Re:Don't Reinvent The Wheel (Score:2)
If you need a search tool, look around for a solution that someone else has already wasted years of their life on rather than have yourself do the same. Why recode, when you can download?
I have never in my professional life run into a nontrivial production business application which was perfect. They all have bugs. They all need more work. So it doesn't matter whether he downloads an open source system, or inherits his employer's legacy system -- he will need to learn the principles behind the techno
How to solve any technical problem made easy ..... (Score:1)
There are 3 basic approaches to solving any technical problem. Let's say, for the sake of argument, you want to cook a roast and have never done it.
Brute Force & Ignorance
Understanding Slashdot. (Score:1, Offtopic)
Ask Slashdot: Understanding Search Engines? 8 of 7 comments
It took a second glance to notice the subtle error in the wording.
Bug in slash?
Re:Understanding Slashdot. (Score:2)
Search Engines are Mature? (Score:1)
Look at Open Source projects (Score:5, Informative)
Re:Look at Open Source projects (Score:2)
Re:Look at Open Source projects (Score:2)
Mozdex (Score:2)
some info... (Score:3, Informative)
A brief intro of how classical search engines work goes as follows:
Grabbing: A crawler visits pages which it considers important, downloads them and parses them
Analysing: The document receives an identification string and is stored in a reversed index, which is simply a database table with culomns such as "word", "document", "possition", "importance". The "word" culomn is indexed and used for searching.
Searching: Say that you search for the phrase "ask slashdot". The search engine searches the lines with the terms "ask" and "slashdot", looks into the "document" cell and selects only those documents that both terms occure in. Then it looks into the "possition" cell which carries all the possitions of the searched word in each document and discards all the documents that do not have successive "ask" and "slashdot" terms possitions. The resulting documents are then sorted according to the importance cells of the searched terms.
This is how basically all search engine works. The only major difference is usually only in the math used to compute the imprtance. There are also some major optimisations done to speed up the responses. To discuss this would take too long. So if you have any questions feel free to ask. Currently I am part of a team developing a large scale search engine, so you have a chance to get some hot info here
Search engine watch (Score:1)
I'm not so sure... (Score:2)
I'm not so sure I agree with you. What do people really search for nowadays (OK, other than "sex"/"porn")? I know where to go for my news, weather, sports, tech news, political discussion, coding tips, dining reviews, etc. The last time I *really* searched for something was a couple months back. I search within websites quite often, but I do not use large search engines like Google, Yahoo!, MSN, etc. more than a few times a year
Re:I'm not so sure... (Score:2)
Re:I'm not so sure... (Score:2)
I was talking more about the average Joe looking for sports scores or news. I think most people know where to go for their information, and search engines are just a last resort when they absolutely can't find what they need on the sites they know. I think Ruppert Murdoc said something to that effect. Although I think Murdoc is wrong about nearly everything else, I have to agree with him on that small point.
Mining the Web (Score:1)
v7ndotcom elursrebmem Search Engine Competition (Score:2)
Some are joining to get prizes from the competition, like v7ndotcom elursrebmem: Blogging for Charity [yugatech.com]
Web Search for a Planet: The Google Cluster Arch.. (Score:2)
there is less info than you might think (Score:1)
Search Technology Resources (Score:1)
For print resources I would suggest:
Understanding Search Engines [amazon.com] by Michael Berry and Murray Browne
as well as
Modern Information Retrieval [berkeley.edu] by Ricardo Baeza-Yates and Berthier Ribeiro-Neto
For online resources I would of course direct you to the work of our Search Focus R&D Group [greenbuilt-research.com]