Search Engines-Does Obscurity Prevent Exploitation? 188
GeekLife.com asks: "Search engines refuse to release (and often change) the exact criteria that determines their ranked results, presumably both to prevent competitors from stealing their techniques and to stop (or at least make less successful) attempts at "cheating" - optimizing a site to exploit these criteria, resulting in a higher ranking than it deserves to be. Is this an example where keeping the specifics a secret actually improves the tool? Or would releasing all the rules result in enough feedback ('given enough eyeballs...'), honing the criteria towards unexploitable results?" Interesting though. Can current systems be improved to give better results or have we reached an 'accuracy limit' as far as keyword-based searching is concerned?
Unexploitable? .... -1 flamebait (Score:1)
C;mon, open it up! (Score:1)
Unfortunately not. (Score:3)
A friend of mine left the company I work for and started making porn pages for an australia based porn company.
He is supposed to make 400 pages per month, all somewhat different. He gets a bonus based on how many hits are generated, and a commission based on signups from his banner ads.
He's doing pretty well financially
Search engines can -always- be improved (Score:3)
The only way to achieve true search engine accuracy is to have an actual person search for pages on request. Why no company has thought of this, I'm not sure, as this could certainly be an explosive business opportunity here. The difficulty of finding trustworthy information on the Internet is legendary, and I'm a sure plenty of clueless newbies would pay a monthly fee to get better search results.
criteria. (Score:1)
Many-eyeballs doesn't apply (Score:3)
If everyone exploited the criteria... (Score:1)
Exploit for Google? (Score:1)
Limits (Score:4)
However, if say google knew that I'd done searches for "albini" and "shellac" in the past, it could probably surmise that when i did a search for "big black", i'm actually looking for Steve Albini's first band, and not BIG BLACK BOOBS, et al...
I can't figure how else something like that could be accomplished without a sacrafice of our hope for privacy...
New AltaVista is now a squirrel tech Google (Score:1)
Tweaking meta tags to ensure high placement in
search results is dead, dead, dead. AltaVista is doing a rank-by-number-of-links thing, just like [altavista.com]
Google. ("Raging Search" and "AltaVista" are the same
information, just different layouts.)
I like inaccuracy in search engines (Score:1)
goooooooogle (Score:1)
"Last November, as reported in Google likes directory sites, I discovered that Google had the uncanny ability to sniff out high-quality, but little-known directory sites. As I discussed in that article, Google was able to do this because it ranks sites according to how many people make links to them, and smart people everywhere learn that directories are important, so they make many links to them."
Then one can wonder about that actual article (Google Propping Up Yahoo In Search Results?), which is not quite as good news. But it still rocks. Plain and simple.
Google is the only search engine I use. Well except when Im lazy and I get me a WebBITCH (webhelp) [webhelp.com] . . . hehe.
/nutt
I vote for obscurity... (Score:1)
The last thing I want is to always, 100% of the time get the 'wrong' web-pages no matter what criteria I use to search for them. I get enough of that as it is.
Fortunately, the search engines keep evolving to try and take care of this. Unfortunately, it also seems to be easier and easier to divert search engines because of holes in the evolution of the browsers or new options allowed in the code or script for web-pages. Hopefully, the search engines will win this fight.
Kierthos
Re:Unexploitable? .... -1 flamebait (Score:1)
Didn't we debate this already today? (Score:1)
Note: This is my lame attempt to be funny.. TO answer the question, how about making a search engine that is free, and NOT advertising based? Maybe if the bandwidth was free..
Ranked by referring pages (Score:2)
Re:New AltaVista is now a squirrel tech Google (Score:5)
Re:Unfortunately not. (Score:1)
A matter of time (Score:3)
Re:Search engines can -always- be improved (Score:2)
However, most of my family (a bunch of engineers) seem to think that the best form of internet searching is asking me to do it for them.
Two days ago, an associate of military persuasion requested I do some searching. After telling him about google, he still wanted me to do the search in my free time at work. He said I could be his 'intelligence analyst'.
My reply would be inappropriate for this forum.
What about a moderated search engine ? (Score:5)
The users could submit links in different subjects or categories with different keywords adding them to the harvested ones and, most important, registered users would be able to get x moderator points a week and vote down spam links or links that don't make much sense with the search one conducted.
Add a healthy dose of meta-moderation (maybe three levels) and some obvious anti-cheat prevention techniques and it should work much better than a normal search engine.
God knows many times even on google obviously poisonous sites come up in the search, it would be so nice to have a button to click to moderate down the page or the domain itself...
It's all about the benjamins... (Score:1)
It would be hard to cheat at that criteria.
Closed source has it's usage... (Score:1)
A question within a question (Score:2)
Re:Exploit for Google? (Score:2)
I thought someone already did this, posting enough links on /. that goatse.cx became the
number one response for searches on "natalie
portman grits"
--
The scalloped tatters of the King in Yellow must cover
Yhtill forever. (R. W. Chambers, the King in Yellow)
It will never be unexploitable. (Score:1)
Like Seti@Home, whose developers chose to keep the source closed to prevent spoofing results. I'm all for open source of pretty much everything, but sometimes there's something that is better off hidden.
Re:Search engines can -always- be improved (Score:4)
Dear GOD, those people at my local library! They must be part of some top secret start-up R&D initiative! So helpful, and for FREE! I KNEW they were up to something!!!
--8<--
Rotating Criteria (Score:2)
So, at best, a company could be really popular on occasion, but suck the rest of the time.
---
Odd search results. (Score:1)
Google: The Criteria Aren't Exploitable (Score:2)
Sure, you could pay people to link to your site -- it's done all the time. But only pr0n sites and the Big Portals are gonna have enough coin to make that a factor in a generalized site. If your site is at all focussed or specialized in nature, you're not likely to have the funds or time to pay people to stay linked to your site.
And that brings me to another point: with sophisticated enough keyword ranking algorithms, it'll become more of a pain in the ass to come up with spam that makes it through the filters than to simply put up a good site in the first place. 600 repetions of "sex" are easy to pick out. And if HotGrammar 2.0 can pick through my dangling participles in a reasonable amount of time, then it can't be too much more difficult to point it at a website and say "Does the content match the keywords?"
At that point I think you really would have some reasonably close to a non-hackable search engines: The Google Algorithm to pick out the sites that people have blessed with links, and The Grammar Nazi Algorithm to make sure there's content to match.
Obscurity? Not here... (Score:2)
What I'd suggest, though, is a compromise. The basic popularity rating is based off of number of links, like with Google. However, people would be able "rate" the effectiveness of a given site when it comes up in the list (nothing fancy, just "relevant," "neutral," or "irrelevant"; maybe a five-point scale rather than three-point). No individual vote could do very much to the system, but as votes add up trends start to show, and relevance ratings can be modified based on this. This would require a user registration system to keep track of moderations (though definitely not searches), but so long as registration isn't required to actually use the search engine itself I fail to see the problem in it.
----------
It seems like Google is kinda exploit-proof... (Score:2)
The biggest exploit I could think of for this, would be for a big company like Amazon to pay a lot of people to link to their site (oh wait, they already do). Thus they can use money to skew the results, without having to corrupt Google directly. Hmm.
Re: PS (Score:1)
I guess this would defeat the page-rank (page vote) idea that google uses. Although as no one's seen their algorithms I doubt they wouldn't have considered this.
The quality of results is the fault of users & UI! (Score:5)
Other than that, though, the interfaces that most search engines use are pretty bad. There is usually no way to filter through a set of results to eliminate things that are obviously not what the searcher wants. Just being able to eliminate a set of domains from the initial results would make a huge difference for me.
Also, most people have no clue how to effectively use search engines - and they're not all that interested in doing so. I've been working in the web industry for quite a long time, and most of my colleagues seem to have no idea that changing the settings can yield better results. The setting 'phrase' for instance, makes a HUGE difference much of the time - yet I've never seen a colleague change any default settings when doing a web search. If you're not willing to do so much as even toggle an individual setting, you deserve the crappy results you get.
Oh, another thing - many of the links I get back are of dubious quality - even on the setting 'phrase', many results don't come back that match what I specified. If you play the the rules and the results STILL don't match, I have little faith in ANY results, even if the web site operators are trying to override accuracy. This is aside the very common result of '404 not found' pages.
Right now, the best search engine I know of is a meta search engine called 'ProFusion' - I've had much better luck with it than with Google. Not enough control over Google...I also like that the results with Profusion ( http://www.profusion.com [profusion.com] ) come back with an option next to each result to open in a new browser window - now THAT's a nice idea!
Another idea - 'demote' button (Score:2)
After all, more people like to complain about a bad link than to promote a good one. Let human nature work in our favor, for once.
---
Banner adds can teach us alot (Score:2)
Since from when I can first remember seeing banner adds, I can also remember seeing "please click here to support this page" right below them. People often end up clicking adds in order to 'support' a site and generate money, rather than being interested in the content of the adds. If people can exploit something for money, they will.
Secondly, banner adds are what give people incentive to cheat on search engines. If they can get more hits per month, that directly translates to more clicks (or impressions) per month, and more $$$. In today's society with the commercialization of the internet and 'dot-commies', I would bet money that the resulting information would definately be used to make somebody a quick buck.
That being said, I would wager that the final result would end up being more reliable search engines, but the potential problems may or may not be worth it.
Open up the criteria! (Score:5)
If you open up the criteria such that *everyone* exploits the criteria, then there is no discrimination. When the criteria is closed, only those who have found the exploits can get increased exposure, making it inherently unfair.
Another issue is that what a search engine wants you to see is different than what you want the search engine to give you, in some cases.
We want the union of two criteria; the results that give the search engine the most use/reuse(usefulness of the search) and the results that give the search engine the most financial recompense(so that the search engine can grow, get better, get faster, etc)
They may not be correlated, but they are both very important. The most useful pages may not give them the most money, and the pages that pay them the most may not generate enough repeat use for them either.
Perhaps the best search algorithm is two step:
Rank according to links (the more links to a page, the more useful the page)
Count repeat use (the more times a search has to be refined, the less useful the pages returned)
Rank according to links already occurs at Altavista and Google.
I don't know that anyone does the second.
Say you do a search on Google; if you hit the next button, then the pages that were generated get knocked a few points. If you hit Google again a few minutes later with a variant search, then knock a few points to *all* the pages that got listed in the previous search. If a user goes back, and hits 'related' pages, increase the points to that page, and all the related pages. Repeat the above algorithm for every hit to Google.
The nick is a joke! Really!
blah blah blah (Score:2)
So what we need is to work on developing an open, complex, and nigh-uncheatable algorithm while search engines continue to use their own closed methods.
Re:Google: The Criteria Aren't Exploitable (Score:2)
Nah, easy. You ever look into reigstering a domain name. Notice how there are all those bulk registration specials, like "only $9.95 each if you register 250 or more?" That's because there are a good many web companies out there with dozens or hundreds of domains. In the case of porn, maybe thousands. It's pretty easy to cross link everything so you look more popular than you are.
Track IPs (Score:1)
If a user is still searching after a minute or two, he or she obviously didn't find what they were looking for.
---
I Asked Jeeves . . . (Score:2)
Re:criteria. (Score:1)
The chick of Golden Girls with little hexagons in her eyes naked?
That's not old age, no sir, those are polen sacks.
Re:Search engines can -always- be improved (Score:2)
I'd like to see a search engine which 'profiles' an individual user in an attempt to improve search results. And that segues nicely into the next paragraph:
Ideas like this exist to some extent, but unfortunately all the ones I've seen are of the targetted advertisement variety (gross..). It seems any kind of profiling ability eventually ends up being used for ads. Are there any search engines out there that keep track of preferences like this, becoming more accurate over time, and without any pornographic vampire junk mail distributors lurking in the background?
It'd be nice to combine the personalization of intelligent agents with the vast databases of major search engines...
Re:Exploit for Google? (Score:1)
Re:Limits (Score:2)
Imagine a different criteria;
If you do a search and within some time limit hit the 'next' link at Google, the search engine should knock a few criteria points off the first page of returned links, under the concept that those were not accurate enough matches.
If you do a new search with a variant/refinement of the original search, then knock off points from all the pages that were browsed as irrelevent.
If you use the 'related pages' then add points to the page that was found.
Every link that is followed should have added points, under the concept that they were good enough, on inspection, to be useful.
At no time does the server ever need to place a cookie to keep track of you, other than session ID or something like that!
The nick is a joke! Really!
Re:C;mon, open it up! (Score:2)
ummm oh i see, as a keyword
There is a way to vote! (Score:1)
If you hit 'next', decrease the relevence points for all the pages returned to the user.
If you hit 'similar pages' then increase the relevence points even higher for this hit!
If the user refines the search, reduce the rp for all pages that were previously generated.
Of course, I don't see any search engines using this criteria, yet!
The nick is a joke! Really!
Re:Odd search results. (Score:1)
Probably an open source issue really (Score:1)
In 1994-96, I ran one of the first-gen search engines (the Open Text Index, r.i.p.), and made my living in a search & retrieval software company for years.
If you look inside the source code for any of the engines, you'll discover that the result rankings is an unholy stew of heuristics layered on linguistics layered on guesses. Among other things, the world isn't in English and there are lots of language-specific techniques. Furthermore, there are people who fine-tune this all the time. Furthermore, the code is shot through with special-case handling, and all sorts of boring tedious stuff to stave off word-spam and <meta>-spam and litigious organizations and so on.
There's no doubt that Google took a serious step forward when they started working the input-link count into the result ratings in a serious way. Works for me, anyhow.
I guess the upshot is that the search engine's source code is the only meaningful specification of how the rankings work. Which probably means that the info won't be in the public domain until they start going Open Source. Which would probably be a good idea, but their management and investors might not see it that way.
-Tim
Why such a complicated system? (Score:2)
Increase relevence points if a link is followed
Decrease relevence points if a link is ignored(next is hit, instead)
Decrease relevence points if a new search is defined(none of the prior search were sufficient)
Increase relevence points if 'similar pages' is followed
It should behave something like what you propose, without additional cookies, work, voting, or otherwise. Other than normal behavior!
The nick is a joke! Really!
Re:Google: The Criteria Aren't Exploitable (Score:1)
Any ideas on how to thwart that?
NEWSTORY: that slash forgot and is too lame (Score:1)
So fuck yo all, slashdot wont report on this now that its broken here before it.
Intelligent? I think not.... (Score:2)
If your trying to make a search engine that isn't easily exploitable how about identifying how current engines are exploited and design around that....
One way is to include the keyword many many times in a comment tag. A possible solution is to grade the keywords via their entropy with the words surrounding it. Such as testing for repeating patterns in a comment tag. Hell use the HTML tags to help you out by not grading anything in HTML comments. Another way is to do some syntactic analysis on the content of tags like and if they are not unique then they are only counted once. Certainly people with bad grammer skills and languages other than the language intended will suffer but you can add verification critera for each language you want to index. Before the trolls hunt me down and say this method is censorship (americans love to say anything is censorship) then you should just stick to the simple useless search engines we have now. Plus one can always implement this as a searching option to an existing broken wheel.
This isn't really that hard it just requires a bit cleverness and lots of prepratory work.
Or maybe I'm just on glue. :P
An AI-complete problem? (Score:1)
Re:A question within a question (Score:1)
Take a step back for more power (Score:2)
Oh, and just to be completely on topic, if the big guys change their search results, Search Rocket needs a quick patch. So in a way, they could prevent tools like this from working by changing the results a lot, but I would think they want to keep it standard due to cross licensing etc etc
Ok, enough rambling. Time to go home.
Re:Another idea - 'demote' button (Score:1)
previous posts [slashdot.org] on the idea.
The nick is a joke! Really!
Re:An AI-complete problem? (Score:1)
An interesting idea.
It would be great to have search engines which specialize in a specific topic. Like medicine or economics or what ever topic they feel necessary. This way you can apply strict rules to everything that doesn't fit your topic criteria... of course this is very niche based but the countless hours I've searched for explainations of alogrithims only to get shareware sites is mind numbing. Try searching for "Berkley mbox lex grammer" .... and end up with nothing useful. (disclaimer: lex parses better than i do so let it do the work)
Re:Track IPs (Score:1)
* Visit sites unrelated to the search, if I find it interesting
* Middle-blick sites I want to see, so the search page is never reloaded
--------
"I already have all the latest software."
Re:Search engines can -always- be improved (Score:1)
The Bugaboo is Relevancy (Score:4)
The biggest problems with Search Engines, is relevancy. The problem being that when I do a search for a word like "magic" the search engine will return results based upon its algorithm, but trying to produce relevancy from a single search word is just about impossible as a task. With a term like "magic" I could be looking for:
Or any of a large number of subjects that I could have in mind at the time of my search. The results from a search engine such as Google, will rank pages which contain the word magic in the page title, multiple times in the body of the page, in the META tags, in or near HREF links, or which are linked to by many other sites higher than those which do not meat these criteria. It differs from search engine to search engine, depending on criteria.
None of these criteria for ranking take into account the nature of my query - what I had in mind when I did the search. In other words they do not directly address the relevancy of the results. If a search engine offered me the opportunity to pick from results it returned and gradually refine the search to produce better results it would be addressing this situation. Some do with a "search again in this result set" or "more like this" type option on their results pages, but its still kinda mechanical, and not all that reliable.
I think it will take some sort of AI analysis of search requests based on user-feedback of some sort and with a learning capability to surpass the current crop of search engines. Until such time as we have some smart systems working behind the scenes on searching any improvements will no doubt be incremental rather than radical.
Now, as for keeping the specifics of how a page is ranked secret I think its absolutely necessary. There is a constant, quiet, war going on between the search engines and the folks who want to get their websites listed at the top of the page when a result set is produced. The people who regularly submit their sites to the various search engines, with each search engine receiving a specially made page generated just for its benefit to ensure that the website gets the best ranking possible etc, are not interested in how accurate the search engine is, they simply want to come up first. The folks at the search engine generally want the most relevant pages to be returned. There is an essential difference of purpose between the two camps.
On the side of the search engines, they have control over their ranking system, and change it peridically to prevent abuse of the system. The folks who are seriously trying to get to the top of the heap in the search engine results are constantly trying new methods to get ahead.
For instance, at one point some webmasters were creating their webpages with a lot of text at the bottom of the page that was the same font color as the background, so that the search engines would spider the contents of the page but users would never see those contents. This let them list all sorts of words that scored higher in the search engines returns, but had little or no relevancy to the page contents. The search engines got wise to this trick and now most will penalize you for using it.
Opening up the search engines ranking rules would only make the system easier to abuse more precisely. No matter how many eyeballs pour over the code, it will still not change the nature of the guy who will use any method at his disposal to get his porn page returned as Link #1 when you do a search for MP3 because its the hottest term currently being searched for.
Google has altered this battle somewhat by ranking pages higher in their results based on how many other webpages contain links to that page (and also based upon the nature of the linking page. They use a distinction between pages which contain a lot of links - like a web directory such as my own Omphalos [omphalos.net] - and those which are linked to by a lot of other pages. Both get points for different reasons and in different instances. I don't remember the details), but even this is open to abuse, although with a bit more effort required. I know of a website which has over 200 different URLs registered and operational, all of which contain pages which point back to the main URL they are promoting. When a search engine such as Google goes to anaylize this website, it will rank it higher because it is linked to by so many separate domains and so many separate pages on those domains. Its harder to abuse, but it can be done.
Of course, this is all basically irrelevant, since each of the search engine companies keeps their methodology and their source code highly protected. It is worth millions of dollars in revenue, and I cannot honestly see any of them deciding to release their software in this way.
If you have not noticed, practically every graduate student who devises a new and effective method of indexing and ranking search results ends up creating their own company once they have delivered their thesis and entered the real world. That is certainly how Google started, and I believe is also how Ask Jeeves got going. I am sure that most of the other main search engines have gotten going in the same or similiar manners.
All that said, If you want to play with a true search engine that is GPLed and works quite well, although not on the scale of a Google or an Altavista, try UDMSearch [mnogo.ru]. It runs just fine under Linux or FreeBSD (I have installed it on both in the past) and I am using it on my site under Solaris. It is still in an intense development cycle and new versions are released regularly, but its worth exploring if you are interested in how a search engine works, and want to get your hands dirty.
For more information on the big boys, check out Search Engine Watch [searchenginewatch.com], and finally, if you are simply interested in Space, Space Exploration or Space Science, check out SpaceRef [spaceref.com].
Re:The quality of results is the fault of users & (Score:1)
AltaVista [altavista.com] has a pretty good set of primitives in its advanced search, including matching against title, the whole URL, the hostname, what sites it links to, and the text in anchors; you can also use "*" to say "search for anything beginning with this". So you could do a serch like:
I find that being able to search by title helps enourmously, and being able to use "*" saves me from having to search on variations of the same term/prefix.
Suppose you were an idiot. And suppose that you were a member of Congress. But I repeat myself.
Re:<META `/usr/dict/words` (Score:2)
Also, -"credit card" and -"member"
usually filter out a lot of shit.
willis/
Re:Open up the criteria! (Score:5)
You seem to forget that the idea of search engine result scoring/ranking is not about being fair to all sites, it's about returning the best result possible.
If you open the criteria, the sites that make money from ads will all use them (the result is going to be "fair" between those sites), but the problem is that the not-for-profit websites (which are much more common) won't chenge their page just to get more hit (they don't care). The result is that, though it ends up being fair to all the commercial sites, but as a user, you're less likely to find what you're looking for... which is the point of using a search engine.
If you just want to be fair, have the search engine return a random URL. Now *that* would be fair!
Re:Unexploitable? .... -1 flamebait (Score:2)
Release their assets? (Score:2)
The sorting mechanism used by the search engines is their way of creating a great search engine! It's relatively easy (given enough resources) to catalog web pages. The big problem is figuring out which one of those web pages somebody is looking for.
Asking a search engine to release it's sorting mechanism is like asking Big Bob to release his secret barbeque sauce recipe in the name of improving barbeque sauce around the world.
Re:Search engines can -always- be improved (Score:2)
What about Web Position software? (Score:2)
Search result usefulness declines over time. (Score:1)
<META NAME="keyword"> tags were a good deal for a while, but that got corrupted pretty quickly. Notice how long some of the high ranked pages take to load? View source on one of those pages and see how many screenfuls of keywords appear in META tags. There are numerous other ways to crowbar the index, this is only one of the easiest -- if the search engine indexes comments for keywords as well, you can cram the same enormous list of keywords into comment lines. Neither is visible to the end user, so all they see is a long delay.
Sorry, this has been a sore spot of mine for a LONG time, and relates as much to obnoxious unsolicited email as it does to pages and pages of results totally irrelevant to my search. (Google seems to filter out a lot of this c**p out pretty well
OK, OK, done ranting, and I know I could probably force feed my own page to as many eyes as the big admongers do, but the point is I *don't*. Maybe it's because I have ethics, I don't know, but it just p***es me off when so many people don't seem to know any better. Maybe they think they're actually boosting their sales. I don't know. All I can say is they're not getting *my* business
About returning the best results possible... (Score:2)
If all the users get are Coca-cola sites, no matter how they search, and these are not relevent sites, then they will get devalued to the point of non-existence.
So any gross abuse gets moderated by the user population.
Thus any useless site, no matter how hard they try, will not be able to maintain a high rank.
The nick is a joke! Really!
Re:Search engines can -always- be improved (Score:1)
Re:What about a moderated search engine ? (Score:2)
such a remarkable idea - USERS moderate how relevant a site is to a category, not a computer.
You can even download the data and make your own directory from it - www.dmoz.org [dmoz.org]
Re:Ranked by referring pages (Score:5)
Sites are grouped into categories known as Authorities and hubs. A hub points to lots of different pages (yahoo for instance). An authority has lots of different places pointing to it.
Where the ranking comes into play is dependant on how good the hub or authority is. A hub is good (better than others) if it points to a number of good authorities. Likewise, an authority is good if it is linked to by a number of good hubs. Yes, this is a recursive process, and yes, it takes a number of passes to get the ranking to level out.
If a pr0n site wanted to exploit google to get a higher ranking, they would first need to create a LOT of dummy sites to link to it, and all those dummy sites would need to be found by google's robot.
However, just having a large number of dummy sites linking to the pr0n site is not sufficent. Those dummy sites would also have to link to a large number of other GOOD authority sites (on pr0n or whatever).
Now, throw another wrench into the works. Google doesn't search only on keywords ON the site itself, but on the sites that refer to it, and the other way around. Thats why if you search for "more evil than satan himself" you end up with microsoft as a prominant result, even though the words evil and satan probably don't appear anywhere on microsoft's website (although maybe they should).
This way, if you were searching for pages about a certain topic, but the pages themselves don't actaully use the words you're looking for, you will still find that page as long as there are good hubs out there that refer to that page and use your search terms in close proximity to the links.
Now, if a hub points to a large number of authorities on a specific topic, words relevent to those topic will then become viable search terms to find the hub when searching, as the hub would also be a good source of information, even if it doesn't list the specific search terms. All of this affects the "ranking"
So, for a dummy hub to get a high ranking, it would need to point to a large number of high ranking authority pr0n sites (which would anti-productive when what you're trying to do is advertise your own site). This would raise the hub rating for certain terms (specific to pr0n sites), and therefore raise the bar on the site you're trying to promote.
Of course, trying to get a pr0n site to come up on a search for "teen" or even "sex" is not easy because while a pr0n site is generally fly by night, there are many legitamate sites which have been around for several years and have built themselves into the web structure well and therefore get catagorized correctly.
-Restil
New MegaSearch Search Engine (Score:2)
Come visit the new MegaSearch©®(tm) Engine, the only engine to find exactly what you want, every time, GUARANTEED!
Features Include:
Never coming from a startup
Search quality is hard (Score:1)
Re:Search engines can -always- be improved (Score:1)
A problem with the opening the rules (Score:2)
Pages I'm looking for are often not designed to exploit search engines anyway. If I'm looking for some technical documentation, it is often just stuck together by someone in a rush, or on a mailing list archive. These pages often don't have optimised META tags, etc, so some engines don't index them well.
Many people are suggesting that making the rules open is good because then everyone will exploit them equally. Perhaps, but that will just make it more difficult to find pages that don't try and get a highly rated search result.
On the other hand, I find Google works pretty well, and how it works is pretty well known.
Re:What about a moderated search engine ? (Score:2)
Yeah, and god also knows that system would never be exploited. Jeez, it'd be worse than it is now!
Shit, do you think Microsoft matches would EVER show up after the first few weeks of use?
Need we go on...
-thomas
"Extraordinary claims require extraordinary evidence."
Security by obscurity? (Score:3)
The arguments against "Security by obscurity" apply here.. so just insert those arguments [here] and I'll move on...
It works not by prevention so much as "reduced body count" and I guess thats the best a search engen can hope for.
When someone thwarts security thats it.. your dead...
When someone tricks a search to give them top results it's just a few websites.. it CAN be overlooked.
So say... 1 person hacks AltaVista.. it's down... blah.. 100 persons hack AltaVista.. it's still down... 1 cracker vs 1,000 crackers... makes very little diffrence... it only takes one defect and one joker to ruin your day...
But with searches... a defect becomes known and you don't fix it in time... 1,000 jokers and your screwed...
1 joker however isn't a problem.... your still online and USUALLY you still give good results... just one bad result...
You get bad results by random chance and user mistakes... so big deal...
But your expecting the joker.. once he's discovered this little trick... won't make it public....
Right now this dosn't happen...
But it's a lot to risk...
Recomendation.... sence obscurity is effective... but not perfict... give away the OLD system...
Provide a liccens that basicly says "Any changes may be used by us at any time with out notice... but only we may do this... all else is open source"
Many of the listings are paid (Score:2)
Re:The Bugaboo is Relevancy (Score:3)
Magic as in Magic the Gathering - a collectible card game I used to play.
Magic as in the occult.
Magic as in sleight-of-hand.
I know this will blow your mind, but no advanced AI is necessary.
Instead of typing "magic," you can add one or two more words to your query, and actually get the info you need! E.g. "Magic the Gathering."
Pretty neat, huh kids?
-thomas
"Extraordinary claims require extraordinary evidence."
Accuracy limit (Score:2)
We have not reached the accuracy limit. A search engine should be able to read my mind and infer the best sites to go to.
Re:About returning the best results possible... (Score:2)
User moderation is an extremely difficult problem. Look at /. - they have a fairly good moderation population, and yet, the system still gets abused. All in all, the /. moderatuon system isn't anything to write home to Mom about, but it works after a fashion.
One of the reasons it does is that while moderators may have an axe to grind (pro-Linux, anti-MS, whatever) - at least their particular stance is somewhere in the same ballpark as the majority of the readers. When you open up the moderation to the world at large, though, and anyone can moderate... sorry, that just doesn't work. You end up with to many people, with to many goals, and too many directions they want to pull a ranking in for it to work out.
Most moderation/rating schemes, even if they don't state so, assume that the moderater/rater is going to at least make an attempt to rate something honestly. The sophisticated ones try to account for the possibility of intentionally or unintentionally bad moderation or ratings. These types of systems succeed when the moderating population is somewhat cohesive, and at least shares the same fundamental outlook. The really sophisticated ones even try to deal with the l337 hax0rs trying to skew everything to show sheep pr0n just because they can... but none of them can deal with the idea that a web page or article or whatever can belong to an arbitrary number of groups simultaneously.
I know of (and worked for) one company that did try this, unsuccessfully; if you have a system this sophisticated, it's far too easy to throw a monkey wrnech into the ratings and turn everything into dreck. I'm not saying that such a system couldn't be made to work; but I am saying that it just isn't worth the effort. By the time you had it working, you'd have developed a general purpose expert system; and if you could do that, you sure as hell wouldn't be wasting your time using it to power a search engine.
Re:Open up the criteria! (Score:2)
When a search engine hits a 404 error... (Score:2)
Would that help a lot with these broken links? What do you think? Is this possible?
''Adversary'' view and randomness (Score:2)
The solution often involves randomness. If search engines built randomness into their weighting criteria, an adversary with the algorithm would still not be able to influence the random (or "pseudo-random") aspect of the weighting.
Other things beyond the control of the adversary are usable too (some people have mentioned excellent ideas such as external links, voting, etc.).
Re:Open up the criteria! (Score:2)
There *already* is a random search. Uroulette [uroulette.com].
I've actually found some interesting sites with this.
The politics of search engines (Score:2)
Even "good" search methods have embedded social values. For example, Google's backlinking methodology tends to reinforce traditional power structures since heavily commercial sites tend to link to each other a lot.
Search engines are in the business of controlling what you become aware of. There are lots of things that become interesting just because lots of other people are also aware of it (e.g., Survivor, Big Brother, etc.).
Search Engines don't really try to maximize relevancy; they try to be relevant enough so that you don't leave. That's why Yahoo uses google search results as a placeholder, but that's just to create more space to promote its own stuff.
Proprietary SE techniques are a bad thing(tm) from the perspective they obscure the embedded social values in their design.
Yes, this is lots of random thoughts but I think this is an important topic.
Re:Search engines can -always- be improved (Score:2)
I mean, you get the "best" of both worlds.
First off, you get accurate results -- which is awesome.
Secondly, who wants to see an ad for viagra when you can see an ad for (Gateway | Dell | etc) which you might actually be interested in? It's like watching commercials on TV. Do you guys really want to see tampax commercials, or do you pay more attention to things like Frys/Comp USA/Best Buy/Computer stuff?
I personally would prefer to see ads tailored more to my interests than to
--
Profusion == No Linux (Score:2)
You can't view Profusion with the Linux version of Netscape (V4.74). Well, you can VIEW it but you can't do anything after that because it messes up the browser (Linux Netscape users are used to this on sites with amateurish HTML). I'll be intrigued when they take pride in their work and hire authors that can write browser-nonspecific HTML. For now I'll just yawn and use the ever more annoying google.
I like using Oingo [oingo.com]. Honestly the hits it turns up are not always great and it is tricky to use but it is not a "stupid" search engine. It attempts some level of ontologizing (ontologization?)... and its about time someone tried that. There are many times when I can't get a decent hit via one of the old fashioned search engines like google or AV but Oingo will find plenty, mostly for searches that require more than just "keywords"... where I'm searching for a related concept or idea and not just a "token".
To answer the question that was asked, (Score:2)
This really is not related to the security-by-obsecurity issue, I think. It involves security in the sense of "keep your passwords secret", not in the sense of "build a system with no bugs", which seems to be where security-by-obscurity fails. The issue here is that rules which people could know about and not be able to exploit might well be significantly inferior to rules which are exploitable.
I suppose that someone who wanted to find out an answer to this could try to get a grant to set up a search engine with a public rating system, and see what came of it. If we can come up with a reasonable metric for the signal-to-noise ratio which resulted from searches, we could find out what really works. By the way, I suspect that one problem with such a proposal would be that no one will bother gaming the system unless it is REALLY popular, and delivers loads of hits. That is, I don't think that this could be done on a small scale: go Google-size, or you won't get any data that applies to the big engines. But you could find out how alternative rules work, with a site that was below the radar screens of the home shopping network and the porn queens.
I have a question for you CS grad students: is there any academic literature on this issue, looking at how people react to this sort of structure, and how the structure must be designed to get them to react in a way which doesn't screw things up? This seems to relate to the information theory field of economics.
Re:The Bugaboo is Relevancy (Score:2)
Maybe the example was bad, but my point was simply that a simple text search cannot show the intention of the user. I am quite aware that you can use boolean searches, or specify addtional terms to get better relevancy. I work on three websites at the moment and all of them run search services.
Specifying additonal terms does indeed narrow down the results, but I am sure if you think about it you can come up with two or three text strings that might occur together but in different contexts.
Re:some SE's and web-pimps already do this (Score:2)
Check out the "cost to advertiser" link under the top 'N' results on GoTo.com...
usually it is just a couple of cents, but I've seen some site pay more than a US$ dollar for your click-through.
Makes me wonder if they knew what they were doing when they submitted their bid...
hey--what's the difference between 2 and
---
Interested in the Colorado Lottery?
Re:Ranked by referring pages (Score:3)
The classic algorithm of this type is called HITS [cornell.edu], by J. Kleinberg.
IBM's 'Clever' [ibm.com] is an enhancement to 'HITS'.
Part of the success of these is that they can be mapped on to well known matrix solving problems...theres enough information in the documents above for you to work out how to write one.
One wrinkle Restil doesnt mention is that the technique is not purely based around link structure. You _seed_ the process with content-ranked pages (hoping the process 'crawls' to the best set independently of the seed), and subsequently you may select the most relevant 'communities' of pages by content ranking. So if you are already in the top 100, say you may be able to content-mangle yourself up the list, but you need good linkages to get in first!
A further criteria used is response time (I strongly suspect Google use this, I got hooked on it when I found that its sites _responded_ rather than hanging as most AltaVista sites did at the time). Again theres publications on this stuff: the shark search algorithm [scu.edu.au] is a spider with this feature.
Some bad things about dmoz.org (Score:2)
The primary reason why it sucks is exactly that it doesn't have moderation... :-) They have a problem that in some categories lots of spammers sign up only for self promotion, and their response is to reject 90% or so of those signing up. Instead, what they should do is to make sure that no individual has too much power, including the meta editors (who are not always awfully clued).
The reason why it never gets corrected is that writing a comment like I do now is considered "illoyal" to the directory.
I have a bunch of ideas for a better web directory. In the meantime, I'm thinking about a classification for skeptical resources [skepsis.no].
Re:Profusion == No Linux (Score:2)
Not at the local library any more... (Score:2)
Actually, I know a bunch of research librarians who have been snapped up by a new company out on Rt128 which sells their services for doing web hunts....
----------------------------------------------
Re:What about a moderated search engine ? (Score:3)
You don't want the search to return the "best" sites! That's not the point. You want the search to return the most appropriate sites.
If I search for "redhat reviews", I'll get both "redhat.com" and "joe's linux distro reviews" on GeoCities.
Now, redhat is obviously going to have been historically rated higher than Joe's site, because it gets more traffic and is probably on the whole a much more useful and informative site. More people will have been happy with redhat.com as a search result than they have been with Joe's.
So should redhat.com be at the top of the list, and Joe's site at the bottom? No! Because if I happen to be looking for an objective review, I don't care what redhat has to say -- I want to know what Joe thinks about the relative merits of redhat and debian. Redhat.com is NOT an appropriate site for a "redhat reviews" search even though it matches the terms and is a highly ranked site.
So search results must be a function of both the site and the search terms, and moderation has to be based on this. This is a very nasty can of worms, because interpreting what the user wanted when he typed in the search is subjective. Doing a simple moderation on the intersection of the search terms and the desired result probably isn't feasible either, since freeform searches aren't discrete enough, there are too many possible ways to phrase the search for distro reviews. Determining what he wants and returning the best results based on previous moderation amounts to full blown natural language parsing and artificial intelligence.
Re:Google: The Criteria Aren't Exploitable (Score:2)
--
Build a man a fire, and he's warm for a day.
Re:profusion isn't so good (Score:2)
Try searching for 'babelfish translator' with PHRASE mode, and it comes up as number one.