Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
The Internet

Is the Internet Becoming Unsearchable? 313

Posted by Cliff
from the even-terrabyte-dBs-may-not-be-enough dept.
wergild asks: "With more and more sites going to a database driven design, and most search engines not indexing anything that contains a query string in it, we're missing alot of content. I've also heard that some search engines won't index certain extensions like php3 or phtml. Is anything being done about this? How can you use dynamic, database driven content and still get it indexed into the major serach engines?" Is keyword searching obsolete? Do you think its time to index sites by the type of content they carry rather than the content itself? Will larger indexing databases (or a series of smaller, decentralized ones) help?
This discussion has been archived. No new comments can be posted.

Is the Internet Becoming Unsearchable?

Comments Filter:
  • by Anonymous Coward
    The one constant throught the history os internet technology, is that directory entries (open directory, yahoo) with structured, categorized results are and always will be superior to free text search for anything that isn't completely obscure.
  • I think we've actually hit another period, technologically, where we're advancing too fast for active standards on "how things should be done" to make things like searching pages/web databases/etc. an accessible, easy thing. It's probably going to take a while...it seems like every month they come out with a new way of doing things, a new "language that's going to change the world!", a new proprietery language/program for corps to use. Until that dwindles, for whatever reason, the web is going to continue to be behind in terms of searchability.

  • One idea would be have a centralized authority have individual machines scan for sites ala distributed.net to expand existing databases. Would this be possible?
  • by Anonymous Coward on Tuesday December 14, 1999 @06:05AM (#1465996)
    ...to "force" search engines to search certain pages. Currently, you cna only tell searchbot to "piss off". There is no way to tell a searchbot "hey!!!! come look at this...."


  • by Anonymous Coward
    What if we just have a standard search interface that can be built in to any DB driven website....say it returns XML or WDDX or something. So now when the search engines hit a DB driven site, it goes ahead and creates an index through this interface. I guess like a DNS zone transfer.....hmmmm...
  • I think the effort many of us put in to make sure that the relevant site info is indexed by the engines makes up for it. Many of my sites include special pages that only the search engines get to increase their "relevance" in the engines databases. What does pose a problem is where people are totally abusing the indexing methods to get their site promoted in searches that they shouldn't. I don't see that anything can be done about that (and efforts by some engines, including ignoring meta tags etc are quite annoying)
  • My company created lots of dynamic sites with
    dynamic content - without the use of different
    extensions or URLs that contain query strings.
    Apache is awesome (in case you haven't heard)!
    Almost all of our HTML files actually contain
    embedded TCL code, so the servers are configured
    to parse every *.html file - allowing to use
    the *.html extension for files that have dynamic
    content. We also use things like mod_rewrite
    to send data to a single file that tells the
    file what data to use and how to behave. We
    could have an entire range of sites served out
    by a single file... even making it look like
    they have thier own directories, when in
    reality they don't exist.

  • We've been running across problems related to this in my office (a web design/hosting/advert firm) and, while I'd like to see non-database driven searching of the Internet continue, I have to say that perhaps, most people, would rather have the database. So many web design clients expect that once they have a web site they won't have to advertise in print ever again are driving the whole thing toward the database method... creating the problem they so love to bitch about.

    Perhaps doing away with keywords entirely, getting search engines to look at the content instead of the "false content" of meta tags... now that would be nice.
  • Once one site is found an a certain topic, it will often be linked to many others; you can find a lot of information this way. I only use search engines in extreme cases.
  • We had a client once who wanted keywords inserted dynamically into the metatags on his webpage based on query results because he read once that search engines index pages based on the tags. Nothing we could say would convince him what was wrong with that picture.

    Is it even possible to index dynamic pages? They don't really exist until the page is generated. Perhaps the best thing to do for sites that want to be indexed is to make sure they have a plain, vanilla index.html page that contains relevant keywords?

    Dana
  • First hit google. Then metacrawler. Then try it as a phrase, then add "-" terms to filter useless results. After that ask jeeves, then the imdb, ubl, mp3.com, amazo^H^H^H^H^Hbarnes&noble online, the manufacturer's sites, then give up and ask someone for it on alt.binaries.whatever.

    Short answer, yes. Long answer->I'll find it if given an afternoon or two.


    mcrandello@my-deja.com
    rschaar{at}pegasus.cc.ucf.edu if it's important.
  • by Anonymous Coward
    As is, search engines just index raw HTML, with no regard for the actual content of the pages. Perhaps as XML and related stuff begins to proliferate, the indexers of the future will begin to use the extra markup to deduce things about the data that are relevant to the searchers. Certainly it needs to be rethought, because as it it is, it's crappier than even searching for text in Emacs. Think of the internet as the worlds largest text file and you're trying to find things using a simple search in a myopic text editor that can only see 1/100 of the whole document anyway.
  • #1: it's easy to make apache run cgi scripts with any extension you want, so php3 and shtml being ignored shouldn't hurt any site that really wants to be indexed

    #2: technologies like XML may give a standard interface to databases, so that search engines can index databases directly.

    IMO, A much bigger threat to the "searchability" of the internet is the rapidly growing amount information -- and with it, the amount of misinformation.
  • How can you use dynamic, database driven content and still get it indexed into the major serach [sic] engines?"

    One obvious possibility is to generate - using the database - a set of static pages as "targets" for the search engines. This could be done weekly or monthly, for example. Each target page would contain a prominent link to the dynamic database-driven front-end of the website, so that searchers could find the site and then quickly get directly to the main front end. Not particularly elegant, but it seems like a reasonable work-around for the time being. The real solution, in the long run, will involve more sophicated searching and indexing paradigms.

    What do people think about this approach?

  • You can tweak Apache to parse documents ending in .html with PHP3. You could use .html for generated content and .htm for static pages.
  • by GaspodeTheWonderDog (40464) on Tuesday December 14, 1999 @06:09AM (#1466012)
    Yeah, give me a minute to back that statement up. :)

    Honestly though. With something that is inherently dynamic like the internet, it is already near impossible to catalogue and make it searchable. Just to illustrate this take any given news site. Today they might have articles about Clinton, tomorrow it might be news about a big fire. Search engines can't just direct you to those sites based on queries because who knows what data they have.

    Even if a search engine was able to validate the content on every site before it gave you the url it could still change by the time you actually got to see it.

    So quite literaly there isn't even a clue of a way to catalogue a database generated web site. Now granted I know there are plenty of sites like Slashdot that eventually the 'content' settles down and becomes static. Still, how are you going to get some stupid program to verify and validate that for *every* dynamically generated web page. I don't think you can.

    The web was created to be open and dynamic and it will stay that way. I've heard people say that maybe there should be *more* interoperability between things like search engines and spiders. This in my mind would do more damage.

    Besides is it so bad that spiders don't get these pages? It probably isn't even reasonable because it would add that much more complexity to the search engine to catalogue what it finds. How do you rank content?

    Anyway... just my 2 cents or so...

  • No only is multipul site searching becoming more dificult, but single site searches as well.

    Now most content is stored in a SQL database. While it is fairly easy to search an SQL database, returning the information in usable form is not. This is especially true once you have many type of tables containing many different types of information.

    Currently, the search engine on the site I work on has it's own built in forms for information from each type of table, but this method takes a lot of maintainance.

    Another possible way is to point to the page (php3, asp, .pl, .cgi, etc.) which generated the information. But this only works if arguments are not required.

    It is about time someone developed some technology to do "smart searches" of sql data and return useful information without having to write a template for each and every type of data that might be queried.

    I might be off my rocker a little bit on this, but I cannot believe I am the only one experiencing these problems.

    -Pete
  • Best thing to do is to create static versions of dynamic content that you want to index (like articles etc.) and use scripting to divert non-robots to a dynamic version.

    You can also make those static pages keyword and meta tag heavy without affecting the user experience.
  • I was beginning to think this as well - Yahoo, Infoseek, Hotbot and the like just don't seem to find the good stuff anymore. If there's content held in a database and a page is generated on demand by an active server page or CGI script for example then the page doesn't come into existence until the user requests the information.

    Perhaps it's time for search engines to search by topic and direct to a site related to the enquiry. The individual sites could then have their own search utilities to trawl through their databases? Not sure if this is feasible or not though.

    In terms of good search engines though - Google [google.com] and AllTheWeb.com [alltheweb.com] seem to find good content whenever I use them. The problem I guess is that you don't know what you're missing until you find it by some other means, and neither do the search engines.
  • META tags and PICs-like protocol should be used. However, WISIWIG editor doesn't help much since META tags aren't a visual parts of the html document.

    I also find that self-registred index sites (like WebRings) can be useful. May be a search engine for WebRings (e.g. look 'Elbereth' on Tolkien WebRing) can be useful (I have to look if there aren't already one).

    Personnally, I use specialized index site (like NewHoo, Linux Life or Freshmeat) when I'm looking for something. Those sites will just have more value in the future, IMHO.

  • One of my biggest gripes with the current scheme is that so many people abuse it. I once searched for "XOR gate" and pulled up porn sites. If sites could be indexed based on content type, perhaps this wouldn't be such a problem. Currently, these jokers dump the entire dictionary in a meta tag and waste everyone's time by throwing off keyword searches.

    The sheer volume of websites out there makes effective searches difficult. I imagine a search engine could be tuned for better results, but will people be willing to wait while it crunches through data longer than a shoddy counterpart?
  • ok how about this. one site sends out a bot. this bot hangs out on the remote machine, collecting info, or filling up its bag with data, then returns to report its findings. this way, the local remote site can then have rules on what is public info, info for a particular client, or not searchable at all, private. I've been thinking about this lately as a kind of shopping bot. there are lots of potentials, and lots of security issues...anyone for an apache mod?
  • by Anonymous Coward
    Since most dynamic sites provide their own internal search engines, it seems that a standard Search Engine to Search Engine protocol could help ease this problem.
  • It seems to me that if anything, the internet is MORE searchable then it used to be. I remember some statistic about how a couple of years ago the few search engines that were around only got some small percentage of the web covered anyway. These days it seems the search engines do a better job, and there are a zillion more search engines and also tools that let you search multiple search engines at once. That and the fact that there is just plain a lot more stuff on the net. Back a few years ago, if you searched for Cervantes, the author of Don Quixote, you might find a page or two on some college webpage somewhere, if you were lucky. These days there are enough pages out there that you're bound to find at least one of them that's halfway decent. Anyway, to summarize, keyword searching still seems to work for me. I think that the only way it will get considerably better is when true artificial intelligence is possible. That way, when you ask the computer to find something, it is actually smart and goes out and finds it like a real person. However, it seems to me that true artificial intelligence is a way off....
  • Today, there are two methods used when a site is added to a search engine database. The first relies on information submitted by the site, the second relies on information (e.g. keyword fields) found by crawlers. As more sites switch to dynamic content, the sites offer no easy way for a crawler to find information about content. This could be solved by developing some method for storage and retrieval of the data. For an example, look at how the "robots. txt" [w3.org]-mechanism works. /Joakim Crafack
  • by jw3 (99683) on Tuesday December 14, 1999 @06:15AM (#1466023) Homepage
    For my own purposes there is no trouble of finding information on the Net - google, Altavista and a few specialized databases are good enough for me, independently whether I look for a pidgin-English dictionary or a protocol for AMV reverse transcriptase. At worst you find a link to an index page with "interesting links" or something. Basic IQ and knowledga about how the search engines work is enough.

    Still, I see a potential threat in information becoming unmanagable, and, most of all, ways of finding information being abused (like using unrelated keywords just to get some visitors). Stanislaw Lem, the polish sf-writer described this situation in many of his books - starting with the 60s, when noone was even starting to think about such problems.. Sooner of later we'll have a large branch of computer sciences dealing only with searching information in Internet; searching services are already available, but they are either incomplete, or not evaluated. The latter is the key: and google is the first service I'm aware of which tries to automatize evaluating (by counting links pointing to a specific page).

    There has been a lot of talk about "Internet agents" a couple of years ago (I remember an article in Scientific American...) - could some good soul explain to me how is the situation now?

    Regards,

    January

  • I used to make a decent living as an Information Broker - basically, a trained database searcher for hire. Along came the net, and suddenly everyone with a modem could search for themselves. So I wrapped my shingle up, and stored it away.

    These days, there is so much junk and bad indexing, that I may as well put the shingle back out. Almost any search will find mostly commercial sites, unrelated to the search, or completely useless garbage.

    You almost have to be in a bizarre frame of mind to create a good search term these days.

    Mark Edwards [mailto]
    Proof of Sanity Forged Upon Request
  • This is probably the simplest solution. Just use an 'insteadof' tag in robots.txt to redirect the `bot to a meaningful page. As is, sites are using hidden pages gobbed full of metatags, relevent text, etc. Point the bot there! You could perhaps just use a space delimited section of the database, and include some standard for how the searchbot indexes the new 'insteadof'.
    'Karma, Karma, Karma Chamelion' -- Boy George
  • by slashdot-me (40891) on Tuesday December 14, 1999 @06:16AM (#1466026)
    I've done some work on a spider and these are the types of pages I spider:
    html htm asp php shtml php3
    I guess I'll add phtml :)

    Other extensions and urls with query strings are ignored. This is mainly for self defense. There are many, many infinite loops and blackholes on the web and they're hard to avoid. For instance, my spider once got stuck on a server that would return the contents of /index.html for any non-existent path. Also, all links on the homepage were relative (not a bad thing) and one was invalid. The call sequence is below.
    GET /index.html
    found foo/broken.html
    GET /foo/broken.html
    webserver couldn't find path, so returns /index.html
    GET /foo/foo/broken.html
    etc.

    What was the programmer thinking?

    This is just one example of the blackholes that lurk on the web. It was completely unexpected and pretty difficult to detect. What if someone wanted to write a search engine trap? I don't believe there is a simple solution to this problem.

    Ryan
  • It's true. What we need is a new standard. The thing that bugs me most about search engines is getting multiple hits on a particular site. If you could distribute the load of the search to the site itself (have a standard search db that could talk to the search engines) and simply index the general idea of the site through meta tags. Then using a search engine (which is more like a distributed database browser) you can browse the individual and up-to-date search dbs of the site.
  • Many companies, especially startups, are turning to using catchy domain names as the way to promote their site and products. Even many non-profits and research groups now register domain names that reflect what they do since many people just type in domain names into their browser - and ironically having a domain name actually helps in being indexed by some search engines; one may debate if this is good or bad, but it's a reality.

    Until there's a standard, the search engines will continue to miss more and more of the sites out there. XML may be the answer to indexing and exchanging data. However, on the bright side, the difficulty of finding data makes censorship much more difficult for the censors - and that's a good thing.
  • Just stop thinking that tera\bytes are the limit. Get more hardware and more computers. Create petabyte databases. In fact have millions of petabyte locations world wide and create a series of multipetabyte databases that one can use.
    Categories are nice but some (most) sites are personal sites and these sites chage quite often in subject matter.
    While the categories are nice we should have a community planned and maintained categorical system along with a plain text search. Have identifier tags that go along with every web site and then have a standalone and a web based version of this program which will allow for anyone to create a hierical listing of anything according to tcertain tastes and peramaters.
  • I think we are already looking at a two-tiered structure: there are sites (that could be found through standard search engines) and then there are databases/archives inside those sites.

    It is getting more and more so that to find an answer to a somewhat obscure question, I need first to find major sites on the topic, and then do a search through their databases or mailing list archives. I believe this reflects a real-life structuring of the Web and will have to be taken into account by next-generation search engines.

    Kaa
  • I don't think the issue has to do with the ability of centralized search engines to index dynamic pages. I think there is a more fundamental flaw in that idiom.
    The lists of problems that exist for centralized search engines goes on and on: dynamic pages (of course), missing/broken/changed links, getting to new pages, and so on.
    What I think could be done is to define a search protocol (perhaps through some kind of search://domain/search+terms method) that is standardized. The global search engines then search by determining the most likely sites to have information for you and querying those sites directly for information. This would fix the problem of broken/missing/changed links being reported, new pages would automatically be available (assuming sites updated their search engines quickly), and if the local search engines are integrated with dynamic page generators (which should be possible) than those pages could be searched too.
    I realize that a lot of work would be needed to be put in to this in order for it to work. A protocol would need to be developed, as well as servers for the protocol. Search engines would have to learn to efficiently decide which sites to query to complete their searches, etc.
    Perhaps a combination of both approaches could yield something even better. All I know is that what is out there right now, well, fails miserably.
  • by sufi (39527) on Tuesday December 14, 1999 @06:18AM (#1466033) Homepage
    One way round the search engine missing query URLs is to write to static pages for the purpose of submitting it to search engines, there are many clever ways of having truely dynamic sites without the need for long urls, you just have to put some effort into it.

    Search engines not picking up on php3 is a bit worrying though, all my sites are written purely in php3, although I never seem to have any problems with getting listed.

    Gateway pages are a good way of making sure you get listed with the keywords you want, although they aren't very dynamic and unless you get really clever don't tend to reflect the contents of a regularly update site... however it seems to me that you can only really hope for *a* listing these days, not an index of all of your site.

    Even google has a 3 month disclaimer on it's submit page, that's a mighty long time if you are looking for support on a brand new motherboard.

    LASE seems to be the way to go... subject specifc full text indexes which spider regularly and can index specialised data keeping it up to date.

    However you would still need a search engine to find a LASE that will get you what you want, but at least it's a bit more structured!

    There are many ways round the search engine problems, and keeping on top of it is a full time job, Submit-it doesn't come close, that hasn't changed in the past 3 years, Search engines however have!

    IMO a combination of all of the above will get you where you want. Keywords and Meta Tags still count, and you have to be persistent.
  • In my opinion, part of the fault lies with the browsers, which poorly handle caching dyanmic content, regardless of whether it is on a remote webserver or a local drive. I for example am forced to add a useless query string to the end of local file URLs so that all browsers will work. Browsers are notorious for ignoring no-cache pragmas and expiration dates.

    The most common way though people find out about worthy dynamic content sites I think is word of mouth. We could use more forums and link referrals to share websites we have found useful. This has the very distinct advantage over search engines of providing a better filter of QUALITY of information. After reading someone's recommendation of slashdot or an article elsewhere, I won't have to hurdle 19 irrelevant hits to get there.
  • I suspect we'll see the emphasis shifting towards more specialised, manually or semi-manually, compiled indices of websites, complemented by robots searching these manually created indices for new/relevant/expired items of information/people/links.

    And alas, as far as the crap that accounts for 75% of the web goes, the cost of accommodating the vast quantities of crap is less than the cost of removing it or improving ways of avoiding it; and until that is no longer the case we're going to have to put up with ever increasing quantities of sewage.

  • The Open Directory Project [dmoz.org], managed by dmoz.org, is an open source effort to create an organized index of the internet through volunteer work. Currently their are 20,000+ volunteers working on the project. This is a way cool idea that we should all support.
  • by Matt2000 (29624)
    I read a while back that meta data for sites would eventually move to an XML based standard which would accurately describe the content of the site?

    Whatever happened to that? I don't mind all that much being taken to the front page of a site if I know that site has the information somewhere in there, I just hate having to hit seven sites to find that one.

    Hotnutz.com [hotnutz.com]
  • I have been thinking about the working of a search engine lately and this post just comes at the right time.

    Some of the challanges which will be faced for search the web in the future will be :

    1. Displaying matching URLs as well as links which match the type of content. This is important. If I search for "throat infection" on a search engine..apart from the pages which mention "throat infection" ..the web engine should give me a link to drkoop.com, webmd.com (AFAIK, these sites do not allow search engines to copy their content) and so on.
    Search engines will have to maintain huge databases linking words to categories. And with the proliferation of hte internet the number of sites carrying content and disallowing search engines is going to increase. Search engines need a intelligent way to get around this.

    2. Search engines will need "help" users with their searches. For example if I just search for "throat" the search engine should have a helper section where it can ask me more...whether I am searching for "throat infection" or "study off throat" and so on.

    3. Search assisted by humans. This is also one of the concepts picking up these days. Basically you submit a question and there will be some person searching the web, and you will get you answer in a few hours/days. Chk out www.xpertsite.com.

    4. Tools for better maintenance of bookmarks. I for one usually bookmark all relevant stuff and then I spend a full weekend arranging them so that I can find the relevant stuff from the bookmarks quickly. The current bookmarking scheme is very primitive causing a lot of users to "reinvent teh wheel" (searching for URLs which are already bookmarked).

    Phew!

    I'll jot down more thoughts later. Gotta work now.

    CP
  • by pos (59949)
    I was just about to ASK SLASHDOT about XML [w3.org]. XML will solve the search problem (or at least help make it better) Working drafts of XML have been drawn up by the W3 Consorium [w3.org] and XLINK [brown.edu], XSL [w3.org], etc... are coming. There are almost no XML applications available yet though!!!!! most of what is available is in java. This is a field where Linux could be leading the pack, but is instead an example where I think we are lagging behind. (I hope someone can point me to a group that is bringing XML deep into the linux os)

    I want to know if Linux is on top of this. Microsoft has an XML notepad available and I hear that it's going to be all over Win2000 (in the registry even). XML will be the foundation of the new internet and we don't want microsoft to have a technology edge there do we? Perl has XML modules, as I am sure other languages do too (python). Lets get some apps written!

    What about Gnome and KDE? this could help make their projects easier. Especially KDE with all of the object similatrities between Corba and XML and Object RDB's. All Config files could be theoretically stored in XML. We need to push this one people!

    -pos


    The truth is more important than the facts.
  • XML, as wonderful and dynamic and potentially-world-saving as it is, is far from standardized. In addition to this, the only serious developments I've seen of any semblance of an XML search engine are being developed and marketed to be used by businesses in their intranets because, well, that's where the $$$ is.... So, in short, my opinion is XML is the only possible savior of exponetially expanding undocumented, unserchable content, it just needs more attention from the user/consumer commuinity.
  • The problem with dynamic content is that you pretty much have to query the target web servers at the time the user enters the search request.

    One solution that attempts to address this is Apple's Sherlock [apple.com]. It uses XML to pass queries to web sites and return results. There are certainly some limitations: you have to choose which web sites you want to search (although this isn't always a bad thing), these web sites have to support Sherlock queries, and it only works on the MacOS. Currently lots of big name and Apple-specific sites support it.

    The dev info at Apple is pretty clear though. It wouldn't be difficult for others to create clones for Sherlock that either work over a web interface or on other OSes too. (dunno if Apple could...or would... make any claim against this)

    Scott
  • No technology is going to read your mind - you're limited by language, and that can be interpreted and misued in multiple ways. This includes searching (e.g. keywords in porn sites) applications. Word misuse will never stop (ask Plato or Burke) so we're just going to have to deal with it.

    Eventually, the *end user* has to do the infromation filtering, so you might as well take what you can get FAST so you can move on if you don't see what you need. Indexing every database or dynamic page on the web would slow down engines to a crawl. Do you honestly want Altavista bringing up books from Amazon, companies from the Thomas Register, and patents from the USPTO? There's no need for this. If you want specialized information, go to a specialized source.
  • by charlie (1328) <<gro.epopitna> <ta> <eilrahc>> on Tuesday December 14, 1999 @06:22AM (#1466043) Homepage Journal
    Many years ago (1994? 1993?) I wrote a web spider. (Crap back end, though, so I dropped it. The bones are on my website.)

    Some time later, it occured to me to try and monitor the efficiency of web indexing tools using a spider trap.

    The methodology is like this:

    • Write a perl module (or equivalent) that generates realistic-looking text using Markov chaining based off a database. Text generated should be deterministic when seeded with a URL.
    • Write a CGI program that uses PATH_INFO to encode additional metainformation. Have it eat the output from the text generator and insert URLs that point back to itself, with additional pathname components appended.
    • If the spider follows a link it will be presented with another page generated by the CGI script, containing text generated by it in response to a hit that differs in a repeatable manner from the text in the original page.
    • Child pages should contain links that point inside the web site; you could do this by making the CGI program the root of your "document tree". Better yet, run multiple virtual servers and include URLs bouncing between the domains -- all of which are mapped onto the same script.
    • Stick this thing up on the web and wait for the crawlers to come. They will see a tree of realistic-looking HTML with internal links, digest, and index it.
    • You can now analyse your logs and monitor the robot's behaviour (e.g. by changing the type, frequency, and destination of links your text includes). You can also search the search engines for references back into your document tree and work up some metrics to measure just how accurately it's been indexed (e.g. by re-generating the text of a page and feeding it to the search engine and seeing what comes back -- which words are indexed and which are ignored).

    Anyone done this? I'm particularly interested in knowing how spiders handle large websites -- have been ever since I was doing a contract job on Hampshire County Council's Hantsweb site a few years ago and caught AltaVista's spider scanning through a 250,000 document web that at the time had only a 64K connection to the outside world. (Do the math! :)

  • I think that, for the most part, the databases are doing their job rather well.

    Where do you find the most dynamic content? News sites. Slashdot, Freshmeat, Linuxtoday, Yahoo! News, etc. These are the sites that need dynamic content.

    Ironically, these are the exact sites that search engines are pretty much not interested in indexing, anyway. Even assuming that a database can update all its sites once per day, that means that the information is a day old-- centuries, in Slashdot time! People don't go to AltaVista to search for the story over at ABCNews.com. They go to AltaVista to find information about international child custody laws (to name a random hot issue of late).

    Most of your general information stuff is pretty much static. This is what the search engines look for anyway-- this is the stuff that doesn't change often, so it's good stuff to record. Why would anybody bother to make a page about Cup 'O Noodles that's generated through a Perl script? It's too tough, and can be a huge pain in the ass to change it.

    Why index the pages that are constantly changing, when the stuff you're looking for (by definition) doesn't change much? Sure, there's overlap (small sites that use generate the exact same content every time). But it's such a small segment that hardly anybody would miss it (yes, it may be important, but not important enough to totally revamp the indexing procedure).
  • Is it even possible to index dynamic pages? They don't really exist until the page is generated.

    Yes, for a very large category of dynamic pages, it is. For example, in an online shop, the actual number of a particular product in stock at the moment may very from minute to minute, the price of that product in the user's preferred currency may change from week to week, but the product itself doesn't change much over months over months or years. It makes perfect sense to index the product page, because although some of the contained data may be transient, a great deal more is not.

    Or take another example: the weather forecast for a particular area. The forecast itself may change regularly, but the page always contains a current forecast and that fact is worth indexing. The best technology available for this sort of thing is probably RDF [w3.org] and the Dublin Core [purl.org] metadata specification. Of course, the search engines still have to be persuaded to take heed of this...

  • There is nothing to stop the web server from sending different pages, depending on whether it's a regular user or a web crawler.

    Therefore, it would be entirely feasable to have a system in which regular users saw regular pages and web crawlers saw a "static" index page, all at the same URL.

    This would allow web crawlers to index according to genuinely useful keywords, rather than by how the crawler's writer decided to determine them.

    An alternative approach would be to distribute the keyword database. Since all the web servers have the pages in databases of one sort or another, it should be possible to do a "live" distributed query across all of them, to see what URLs are turned up.

    This would be a lot more computer-intensive, and would seriously bog down a lot of networks & web servers, but you'd never run into the "dead link" syndrome, either, where a search engine turns up references to pages which have long since ceased to be.

  • That's the problem. Obsure is all I ever deal with. You are right about heirarchies, however. miningco [miningco.com] is a pretty nice one as well. Oh, and what became of newhoo [dmoz.org] also....Just thought I'd point them out because I had almost completely forgotton about dmoz, and I keep stumbling across miningco pages from other searches, and they sem worth mentioning...


    mcrandello@my-deja.com
    rschaar{at}pegasus.cc.ucf.edu if it's important.
  • by Zigg (64962) on Tuesday December 14, 1999 @06:23AM (#1466048)

    I think the real problem with searching really isn't that the Internet is growing too large. The central problem with it being too hard to find information is due to the unfortunately ever-changing nature of HTML. (Yes, I know there are much better solutions out there -- I work with some of them on a daily basis. However, we seem to presently be stuck with HTML and its variants.)

    It's a self-feeding monster, whose typical cycle goes as follows: SearchEngineInc (a division of ConHugeCo) creates a new technology that really impresses people with its ability to find what they want more quickly. (Right now SearchEngineInc is probably Google [google.com], at least in my view.)

    Once the new technology takes root, content authors (well, maybe not the authors so much as their PHBs) note that SearchEngineInc doesn't bring their business (which sells soybean derivatives) to the top of the search list (when people type ``food'' into the search engine). Said PHBs make the techies work around this ``problem'', and all of a sudden SearchEngineInc's technology isn't so great anymore because the HTML landscape it maps has changed.

    A similar situation occurs when PHBs think their site doesn't ``look'' quite as good as others. (Insert my usual rant about content vs. presentation here.) Whether via a hideous HTML-abusing web authoring program, or via all sorts of hacks that God never intended to appear in anything resembling SGML, the HTML landscape is changed there as well, and SearchEngineInc's product becomes less effective.

    What's the solution to this? I'm not quite sure. Obviously there are better technologies out there that are at least immune to PHBs' sense of ``aesthetics'' but I would wager few of them are immune from hackery. I'd say that search engine authors are doomed for all time to stay just one step ahead of the web wranglers. At least it assures them that their market segment won't go away any time soon. :-)

  • I have to say, yes. I believe that with the way the internet is growing, it's difficult to keep up with new pages and new technology. I know there have been several times I have done searches only to turn up nothing when I KNOW it's there or to turn up too much which pertains to nothing I'm looking for. Most of the more mainstream search engines have become obsolete, I'm afraid. Many of them use methods that just simply aren't practical like searching for certain words in the text of a page. When you search for things like that your searches will not be accurate and often you'll get information you don't really want or need.

    So, I believe the internet is outgrowing the current search engine technology.
    -- Shadowcat
  • I think it is just the plain and simple truth that the searching algorithms all of the search engines use currently are not suitable for the task. I will perform searches on what I think are pretty obscure terms and return >10,000 hits on some of these search engines. Of course, none of them mean anything to me.

    I'm not saying that this problem won't be figured out at some point. It's going to take a little more technology than we have right now, but no doubt it's on its way even as we speak. (Any AI experts out there? :)

    Until then, indexing by hand seems to be the only 100% solution. Humans are fallible, but much less than the machines are at this present stage. Plus, directories geared towards specific topics would help narrow down your search before you even start searching.
  • There is no excuse for having a purely database-driven website that does not appear to be straight HTML pages. If you have ?s everywhere then you're just lazy.

    Firstly, even though you might pull everything out of a database, a large per cent of all such content is not really all that dynamic, which means you're probably better off precompiling the base down into static HTML, and recompile the page only when its content changes.

    Secondly, if you have a script with a messy query string you can turn it into something that doesn't look like a script at all, e.g., /cgi-bin/script.cgi?foo=bar&this=that could be presented as /snap/foo/bar/this/that.

    With Apache, you would just define and pass it off to a handler, that would pick up the parameters in the PATH_INFO environment variable. If people tried URL surgery, you could just return a 404 if the args made no sense.

    Search engines are your best (and probably only) hope of getting people in to visit your site. It's up to you to make sure your URIs are search-engine friendly. If they can't be bothered to index what looks like a CGI script, well that is your problem. There are more than enough pages elsewhere for them to crawl over and index without bothering with yours.

  • So then you put invisible content in the page instead. Same result.

    There will always be a way to "fool" indexing robots if you're creative enough.

  • by Lord Kano (13027) on Tuesday December 14, 1999 @06:24AM (#1466053) Homepage Journal
    It disturbes me that so many pron sites have hidden in their html code (and sometimes not even hidden) huge lists of adult film stars just to get hits from search engines.

    If you do a search for Cortknee or Lotta Top you'll get a bazillion hits and 90%+ of them are "Click here to see young virgins having sex for the first time on their 18th birthday!"

    As we all know, but nobody likes to admit, pron is the fuel that makes the net go 'round.

    Many other sites have taken hints from the pron people. I'm sure that it was a deal of some sort, but everytime I do a search on metacrawler there's a line to search for anything I get a like to search a certain bookstore for books on the same topic.

    Commercialism and shady practices are what are making the net so hard to search.

    LK
  • The distributed idea is interesting, but I think the problem lies not in the actual power and bandwidth of the search, but more in what exactly we are searching for. Much like the article says, I've noticed myself that many of the search engines today don't find exactly what i'm looking for and I myself am still stuck sifting through their results.

    I think the search engine community needs a paradigm shift in their way of approaching searches now, with the curve dynamic information has thrown at them. I don't know how well standards would work in this situation. It's up to the search engines to come up with a new way of sorting the huge ammounts of data they collect in an orderly fashion so they can serve us searchers with exactly what we ask for. Ok, so "exactly" is probably stretching it a bit, but I'll settle for pretty damn close.
  • Shouldn't you index all files with MIME type text/html or text/plain, regardless of file extension?

    Just curious.
  • by Pollux (102520) <speter@tedat[ ]et.eg ['a.n' in gap]> on Tuesday December 14, 1999 @06:30AM (#1466058) Journal
    Okay, I just got done with my research paper for college last week, and although I can pull a paper out of some orifice of my body, researching is always a pain.

    Our library has a wonderful online database where you can type in keywords and search for them, but the keywords only look as far as the Title, Author, or abstract of the book. If you wanted to look up some narrow topic, you can't expect that there's books written exactly on that topic, but there's always bound to be a few books out there that have a few pages dedicated to that subject (but isn't listed in the abstract). So, what do you do? You have to get your hands dirty.

    My topic: Holy Wisdom (I won't bore you with details, but just stick with the subject). Looking in the online database, I find that there are zero books on the subject. Darn. Let's do some lookin...

    After I read in a few Religion Dictionaries, I find that Holy Wisdom is also called "Sophia." I go back to the catalog, type in "Sophia," and I get one book. I skim this one book, and find that Sophia has sometimes been associated with the Holy Trinity. So, I go back to the catalog, enter "Holy Trinity," and BOOM, I get back 400 results (anyone seeing a similarity here...). Let's limit them...we'll search within the results for "History of," and I get back about 11 results. I read the abstracts, find a few books of interest, and start skimmin...

    ...Well, whadda know, there's a page in one book that talks about Sophia, and half a chapter in another book that talks about Sophia as well. There's a few more sources for the paper!

    Now, for those of you who just don't understand what I'm trying to say here, just read from here on, cause here's my point: Computers aren't smart enough yet to "guess" at what we want, and personally, I don't think they ever will. Internet keyword searches are just like asking someone to help you who has no idea what your topic is...they can only search for what you ask them to search for.

    Internet keyword searches are a hastle, and many times the first few returns won't be anything CLOSE to what you want (search for "Computer Science," you get back porn, search for "Linux," you get back porn, search for "White House,"...). But if you learn how to dig, like the people who lived fifty years ago WITHOUT Boolean Searches, you'll find what you're looking for. Sometimes, it's just like searching for a topic...you might not find anything directly, but you can't sum up an entire book in just a paragraph either!

    Try some links, look around, and it'll be there!
  • Many of my sites are database-driven sites that run on PHP and MySQL. No problem with indexing, and no problem with the file extensions.

    If you can get beyond the backend concept of a dynamic page, most pages really appear to be quite static, from an indexing perspective. A http-based indexing system (as opposed to filesystem-level) can't tell that pages are dynamic, and don't care.

    I've never had a problem with search engines failing to index pages just because they had convoluted URL. If some engines do that, it's a bloody shame.
  • take a look at http://xml.apache.org


  • Dynamic pages don't exist until you click on them in your browser either. Search engines *will* follow links to dynamically generated pages.

    The point is there has to be a link there in the first place. They will not be able to index a dynamic page if it is only accessable through a "form" post.

    The way you can get around this is to have a hidden (to users) page on your site with hardcoded (or database generated) links into the dynamic content that you'd like visible from search engines.

    For example, if you have a whole heap of news articles on your site, with one per page, you can make a dynamic page called "newslinks" which, when generated by a crawler, querys the database and writes links to every news article in the site.

    cheers, j.

  • IMHO we're already seeing the advent of meta search engines that do their own search and then do a simultaneous search using other engines. (Yahoo does this, I think, as does lycos/hotbot) That's a great kludge for these engines to extend their reach, but not a real solution.

    I think we'll see more topic-specific search engines (I use trade rag sites exclusively for really good info on tech news, for example) linked together through the big search engines. The main engine (Google, or whatever) will check the search term to see if that term has been pre-linked by the engine managers to generate a search on a more topic-specific engine (for example a search on "market size" may cause the engine to do a lookup on the northpoint search engine) or engines, and then combine the results of its own search with that of the topic-specific engine for relevant results.

    It's the whole idea of vertical portals taken to the next level. The vertical portals provide topic-specific searching capabilities over the 'Net to the behemoth engines and portals for a fee, or something.

    Remember, the user will not get smarter, but will rather look for the faster and easier solution.

    IMHO.
  • by TrueJim (107565) on Tuesday December 14, 1999 @06:30AM (#1466063) Homepage
    I'm going to say a naughty word: artificial intelligence. I'm hoping we soon ( 5 years) get good enough at this "indexing" stuff to create semantic models of Web content rather than purely syntactic models. (Google is a small step in the right direction.) If so, then perhaps dynamic pages can be indexed according to their location (role?) in an "ontology" rather than via the frequency of essentially meaningless character strings. That may sound farfetched, but it seems to me that the Web finally provides a real _financial_ incentive with near-term payoff for that kind of research. Hitherto, the quest has been purely academic. And where there's the lure of a real payoff, stuff often happens quickly (usually -- batteries and flat-screen technologies being notable exceptions).
  • Every Web/FTP server must have a standard, live query engine. Every week or so, some sites would query them, and update their database, but only to the site level, if the end-user want, must query in the site for the page, in a second phase search. [Buen español, bad english]
  • Look up in the sky.. its a bird, its a plane, it web sites dumping their information from hard to index databases to easy to read XML!

    Wasn't this sort of thing what XML and RDF were originally designed for?
  • I think that this is going to force the search engines on focusing on sites rather than pages.

    as a site can be described by keywords even if their subsequent pages are database driven. i like searching by site usually anyways - provided that the site has a nice search engine :)
  • by Simon Brooke (45012) <stillyet@googlemail.com> on Tuesday December 14, 1999 @06:38AM (#1466083) Homepage Journal
    This is a field where Linux could be leading the pack, but is instead an example where I think we are lagging behind. (I hope someone can point me to a group that is bringing XML deep into the linux os)

    Not so, fortunately. A certain very large telco (which I'm not yet allowed to name) is now running its Intranet directory on an XML/XSL application which I've written. The application was developed on Linux and is currently running on Linux, although the customer intends to move it to Solaris.

    My XML intro course is online [weft.co.uk]; it's a little out of date at the moment but will be updated over the next few months.

    XML and particularly RDF [w3.org] do have a lot to offer for search engines - see my other note further up this thread.

  • The answer to all this isn't going to come from making existing engines better, nor is it going to come from bigger, badder, faster database engines powered by your friendly clustering technologies!

    The answer is simple: More specialized search engines. You're looking for technical stuff? Then you should be able to search a technical database. Like, if I'm looking for source code to model fluid flows - that's pretty specific already. There's no reason that I should have to wade through all the references to "bodily fluids" that I'll get on altavista for instance!

    Search engine people, take note of this. Classify your URLs into categories - like Yahoo - but come up with some way to do it automatically. Or even better yet, let the users do it, a la NewHoo [newhoo.com].

    End of internet predicted. Film at 11. We've heard it before, and we'll hear it again. Just need someone with a little VC money to throw it towards an idea that supports more specialization in search engine tech.

    Kudos..

  • I use a free site statistics service to keep track of hits to my web site, where I keep some software that I've written. Looking at the referrer statistics to my site, the vast majority of hits are generated from explicit, categorized links to my site (e.g. bookmark pages and surprisingly Lycos which has a categorized database), and rarely ever from general search engines like Altavista. The questioner may be right - from the perspective of a web site owner, general search engines aren't very effective at bringing visitors to my site.
  • by debrain (29228)
    IIRC, XML was designed to help alleviate this sort of thing. Unfortunately, XML has not been exploited enough to have any significant ramification on the way the internet is sorted.
  • by johnnyb (4816) <jonathan@bartlettpublishing.com> on Tuesday December 14, 1999 @06:42AM (#1466090) Homepage
    Why doesn't anyone use the ScriptAlias directive? It does the same thing as query strings, but makes it look nicer, like the rest of the web. You can "say" your looking at a directory or a .html file, but in reality you are viewing a singe script. For an example go to http://store.wolfram.com/ [wolfram.com]. There are no directories on the server side, it's all served off of one script. Yet, to the user, it appears as a hierarchical directory structure, complete with .html files. The only query string is your session id, which is appended to the URL in case your browser doesn't support cookies (however, these are not there if a robot views the site). Anyway, a simple directive like ScriptAlias can save everyone a lot of trouble. If anyone has questions about its usage, send me an email [mailto].

    Jon
  • by evilpenguin (18720) on Tuesday December 14, 1999 @06:46AM (#1466095)
    I hate to do an "amen to that, brother" post, but I'm going to do so.

    Any reasonable search term is likely to present results like "Search returned 417,373 hits. Hits 1-10 displayed." You have to then winnow by adding include and exclude words until you get it down to a manageable 7,422 hits, then you browse them.

    The truth is, I turn to wide searches quite rarely. I tend to find and "bookmark" authoritative sites I find on a given topic and return to those over and over again. It is only when a site grows noticably stale or I have to research a new topic that I turn, reluctantly, to search engines. As for indexing database sites, I like the idea of extending the robot hack. Slightly less appealing would be to have a new HTML tag to include "bot content" in any page, including dynamic pages. An XML solution is a good idea, but I wonder how long before every extant site gets XML-aware? That plus XML is almost too flexible, making it likely that a hundred competing methods for indexing dynamic pages will appear and no one will know which one to cling to.
  • Hmm. How about this for an idea:
    when a webbot sees a dynamic page, it changes the query to ?Webbot - and expects to get back a specially formatted page starting <H1>Webbot index</H1> and followed by a set of comma separated keywords, a break, an URL, a paragraph, then the next set? The webbots would be happy, as they don't have to waste bandwidth and cpu time spidering over the site; the server should be happy, as it doesn't have to support the webbot's spidering, and the site owners should be happy, as they can specify what keywords each result will be indexed under. obviously, just reformatting the index to the product database could generate this page for an ecommerce site, and more static sites could just use a static statement of what their site carries.....
    --
  • by GC (19160)
    For me the best way to search the internet is to go to a site dealing with the context of your query and search that site with it's own search engine (which most major sites have).

    It would be nice if a generic search engine working in the following way:

    1. User searches for say "Cisco VPN Routing"
    2. The search engine identifies sites www.cisco.com and other sites which are related to the search query string.
    3. Instead of trying to account these sites it calls on the search engine at the site matching the context and queries it instead.
    4. Returns the results of the search at cisco.com to the user.

    It's kind of like a distributedSearch, where the actual search is done by the holder of the data, all that the search engine actually does is try to find a context for the Search Query and find sites with their own search engines that match that context.

    So in answer to your question: My answer is No, the Internet isn't unsearchable, we just haven't implemented a reasonable standard for searching, which can be as important as routing when it comes to a network of the size of the Internet.

  • Only really bad db driven web sites use changing content on static url's - almost every site that runs off a db (take those news sites you mention: cnn, news.com, cnbc, etc) redirects you to the content page which is either a unique identifier or a url with an embeded date. I see no reason why that can't be indexed - and in fact it is indexed.

    And yes, even slashdot uses this scheme :)
  • I like a snappy search engine response time as much as the next guy, especially when I'm looking for something fairly current or mainstream. But how can you tell a search engine to tread farther off the beaten path?

    For example, a few days ago I was looking for the dip switch settings on an old 14.4k modem. Now I *knew* the info was out there on the web somewhere. I also thought it was highly unlikely to be in any of the major search engines in-ram indexes. I would have been quite happy to submit a boolean or reg-ex query to a search engine and then check back an hour later to get the results.

    In my mind, instant gratification search engines are useful and have their place, but I see a whole segment which just doesn't seem to be addressed. Is anybody even thinking about working on this?

    -matt
  • Give anyone the ability to talk directly to search engines and you'll see what has been happening with those damn porn sites on a large scale - do a query for anything, and it'll come up with a totally unrelated porn site for you.


    People figured out how to abuse keywords real quick, and this would just make it worse. Which is why I wonder about the contnued existence of search engines. I use \. as my search engine - I use it to index my way into the web every day. I think that's the way of the future.






    PS I hate the G3 keyboards. They're tiny! It's like carpal tunnel syndrome x 5!!!
  • I've had good luck with my search terms. I always use +"this format" for my searches and make sure the phrase is unique enough to bypass keyword-abusers. Google automatically adds the AND operation. I've found altavista and google to give me the best results, but generally the +"technique" has served me well on any search site.
  • A site-specific search service built around the newly-opened E-speak? Damn good idea.. Not only would it provide an easy interface for searchbots, but in the future it could provide information for user-agents and other client-side searches. I'd imagine it wouldn't inflict as much server overhead as the current system.

    I'll be flipping through the 'E-speak tutorial for the rest of the afternoon!

  • Librarians have been indexing things for a long time, and for decades researchers have been trying to make computers do indexing. Applying AI technologies for indexing is nothing new. The challenge is making a computer understand a topic and text well enough to properly index.

    Browse a few relevant papers and find some keywords to search for more of the part of the field in which you are interested:

  • Unfortunately, even Google is fast becoming useless - practically every search I do on it results in thousands of mirrors of some page describing some RPM that I care nothing about.

    Search engines act like Lem's Demon of the Second Order right now - returning lots of information, but very little of it of any use or relevance. I've thought a bit about ways to improve it - say, a perl script that queries half a dozen search engines, and uses the pages that appear in a majority of the results, applying simple rules to them, based on markup (Like giving keywords that appear in a heading element higher priority than those that show up in a paragraph) and the number of other pages in the search area that link to them... Not AI, just a bunch of heuristics. Adding hierarchy-based rules (Ie, page B is only found via a link on page A, and A and B have similar URLs, so B might be a sub-page of A, and shouldn't be considered if A passes everything, because you'll get to it from A anyways) is an interesting possibility if I ever get around to writing something like this. I think the same rules could apply to static and dynamic pages, though. No need to treat them differently, aside from result caching.

  • Is it even possible to index dynamic pages? They don't really exist until the page is generated. Perhaps the best thing to do for sites that want to be indexed is to make sure they have a plain, vanilla index.html page that contains relevant keywords?

    It depends on what technology you're using to generate the pages.


    Zope [zope.org] sites for instance, are totally dynamically generated, even those pages that would normally be static. But the entire content of the site that's stored in the ODB is traversable via 'normal' URLs. This means that search engines can easily index your entire site.


    Note, however, that this only works if you've taken care to expose your content via links. If you've delibarately hidden your content behind a search interface (and you can still do this with Zope), then your site will be no more indexable than any other dynamic site.


    --
  • It seem to me that having URLs with extensions of:

    .asp, .php, .phtml, .shtml, .pl,, etc.

    is incorrect. What is being served is not an ASP script, nor is it a PHP script, nor it is a Perl program. It is, however, an HTML file (or a GIF, or a PDF, etc.), and should be labelled as such.

    If your server isn't smart enough to figure out how to generate the requested resource, and needs the generating program explicitly mentioned in the URL, then you need a smarter server. And if you aren't smart enough to figure out how to do this correctly, well...=)

    Remember, kids, a URL != a file. All the /. end user cares about is getting an article with the comments formatted appropriately. They don't care[1] if it's stored as a text file, or generated by Perl, or..

    [1] Well, they might care in a geek sense, but not in the way needed to read comments.

  • The web is certainly becoming significantly more difficult to search, especially for informational content. Just -try- searching for information on a musician or an author... you'll get links to the like of music.com, amazon.com, whatever-your-topic-is.com, with a little one-or-two paragraph blurb about the person, if you're lucky. Hundreds of links like this to every little virtually-hosted e-tailer out there. Somewhere, buried in all this, will be the informational content hosted on a personal webpage or at some non-profit organization. Anyway, so, that's the problem, or an aspect of it, we already know this.

    Good news! The solution is coming. Maybe the solution is here. google.com has their unique approach to web-indexing. Another method that's probably going to be tried sometime soon is to look all the natural-language-processing technology that has been researched in the past twenty years, take the most efficient heuristics, and index pages by apparent-topic instead of by keyword.

    Then there are places like anipike.com - if it's a web page about Anime, it's on anipike, or it may as well not exist. I would -never- search the web for anything anime-related; I go through anipike.

    I'm really, really hoping that linux.com will become that useful to the linux community, but I don't think they're quite there yet. They may never be. Anipike is generally very fast to load, especially compared to linux.com ... probably 'cause it's mostly static pages and there are not so many anime fans as there are linux users. But that isn't really relevant; if linux.com is going to become the search-engine alternative for linux-resources, they need to respond quickly at all times of the day and night, otherwise 'Joe's Linux Links' is a better option.
    (Apologies to any Joe out there who is proud of his links page. :))

    Anyway, currently I still use search engines for Linux-stuff, but as I keep getting more and more hits on rpm files cluttering up the informational content, that may change soon. (Especially since I'm a debian user! I'm looking for information when I search the web, I know where my package is. :))

    --Parity
  • Actually you can do something similar already. Just sniff for the robot's user-agent and display different content for the robot, like a very structural, correctly coded HTML index of your site.
  • by sxxw (124344)
    You're thinking of RDF [w3.org], the W3C's language for embedding metadata information into XML (and by extension XHTML) content. This is great for page specifc information (such as Dublin Core metadata), and can also be used to provide metadata information about collections (such as a set of web pages, or an entire site).

    However (there's always a however) there's the metadata catch. If you divorce metadata from content, then it becomes easy for site admins to lie in their metadata in order to attract vistors. Remember the keywords spamming that used to occur? Now, imagine if thats extended to being able to lie completely about the content of an entire site. Unless you're in an environment where you can trust the providers of your metadata, by and large you're in trouble.

    Cheers,

    Simon.

  • Do search-engine spiders avoid you because your page addresses end in "forbidden" extensions like .cgi, .php3, etc? Do they ignore anything with a ? or & in the URL?

    The solution is easy. Don't use them in your URLs.

    Do not use GET args in dynamically built links, but hide your args in a longer plain ole URL. For example, a script at http://www/x/y can actually interpret http://www/x/y/z/ just fine and you can then parse off z as an argument.

    First, alias a directory that runs your CGIs, PHPs, etc. Like you would cgi-bin but don't call it that!

    Then, plant your cgi program(s) in there. The "arguments" further down would be in the PATH_INFO variable (which you'd have to parse out manually).

    So, in the case of http://www/aa/xx/yy/zz/ the script is in the aliased /aa directory. The script is named xx and the PATH_INFO passed to it, in the above example, would be /yy/zz/

    This works with Apache. Don't have Apache? Upgrade today at www.apache.org [apache.org] :-)

  • The client shouldn't infer the type of the object based on an "extension" in the URL at all ... that is what the Content-Type header is for!

  • Think Google.

    Google works on the idea that pages that have a lot of incoming links are authorities on what they discuss, so they should be ranked highly.

    A modification of this is to not only rank a site's authoritativeness (eh?) this way, but also what kind of content it has. So if 10K geeks all have homepages that include the words "geek" and "computer" and also point to /., it is reasonable to assume that regardless of actual content today, /. typically is a good result to return for the search "geek sites".

    Of course, some of those homepages will also have the words "tennis" and "knitting", that will be spuriously attributed to /., but the idea is that they will be outliers, and drown in the noise.

    This basically is keyword indexing, but the keywords are dynamically determined, rather than using the broken meta tags.

    The big problem with this approach is implementation; the association tables are likely to be huge.

    Also, you assume a large sample size, so that the outliers will cancel.

    Johan
  • I have a few (very pertinent) meta-tags on the information page for a mailing list that I run. The tags are designed to get hits from people looking for my list. But, it seems that the meta tags don't work in some of the major search engines. Perhaps the engines have caught on to the practice of embedding surperfluous tags in order to get hits on engines. I think I'll have to rework my page to make sure that the key phrases that I'm trying to get hits on actually appear in the text.
  • Yes, but many file systems, which may be the destination of the results of the HTTP request, *do* make use of extensions to determine file type. Though, perhaps, storing MIME-type meta-information would be better, we're stuck with what we've got.

    Also, I mean the URL to also be used as a user-interface. For example:

    http://slashdot.org/99/12/14/1154243/comments

    would generate your browsers's preferred format, whereas requests to:

    http://slashdot.org/99/12/14/1154243/comments.pdf
    and
    http://slashdot.org/99/12/14/1154243/comments.scml

    would return the PDF and the Slashdot Comment Markup Language (an XML app) respectively. This could be done with content-type markers, but the interface is much poorer than simply using file extensions.

  • You missed the best stuff, though! You forgot to mention:

    • {Hn} tags to denote headers instead of using stupid workarounds like {font size=n}. The former denotes structure and importance.
    • Use of {strong} and {em} instead of {b} and {i}, again denoting structure.

    Take a look at my site, theFYI. [thefyi.com] Still a work in progress as the backend isn't done (yet). Dig through the source and see how it's built. I would have loved to use CSS for element layout but, hey, the browser support just is not there yet. Stuck with tables for a few more years. BUT take a look at the structure around each article. The header is denoted with an {h1} tag, its appearance changed with CSS-1. The paragraphs are marked with paragraph tags and, well hell, the linked URLs are surrouned with {cite} tags. That's how you code indexable HTML.

    Used a lot of the same tricks on another site, http://www.ptrm.org/ [ptrm.org] and the site does well in the search engines. Specifically, check out the page on the PTRM's paleontology field tours [ptrm.org]. It does well in the engines simple because it's got 'dinosaur' in the page title and in a header tag.

    (Yes, I know that curly brackets don't go around HTML tags. I just didn't want to escape the angle brackets everytime I used an example of HTML)

  • I think you've mentioned the key here. The solution is the use of specialised search engines. I find music with http://search.mp3.de, I find hacks with http://astalavista.box.sk, etc.

    I look for most of my information with HotBot [hotbot.com] just because its advanced search option lets me really weed out the bad hits.
  • I'm involved in Dizz-Net [dizz.net] and I was thinking of a way to separate commercial websites and websites that may or may not be somewhat commercial, yet contains useful information (Slashdot is commercial, yet contains useful information).

    Example: Suppose you're looking for information on a Zip drive. You already have the drive but are having trouble with it (problems with zip drives? really? :-)* ). You do a search and you get a million sites that "guarantee lowest prices!" You could go to the manufacturer's site but they may downplay and misinform about *ahem* "certain problems" (1st zip drive gets click death. Get replacement. Feel ripped off because replacement is refurb. Vow to never buy anything Iomega ever agin when 2nd drive starts to click. Use harddrive to backup zip drive. Hope that it falls into some black hole in the bottom of closet).

    Of course I don't even bother, I just go straight for the LDP, but Windows users don't have that option.

    It would be interesting to be able to search only engineering sites for engineering information. I once did a search for "wheatstone bridge" and got tons of $cientology links. If the engine was able to determine if a site was, in fact, an engineering information site, that wouldn't have happened.

    How about a "no pr0n" checkbox. That would be sweet.

    Of course that would require a herculean effort in changing the standards and getting site owners to be honest.

    But maybe not. Here are some ideas I was thinking of:

    • How about manually specifying the categories for some major sites and letting the search engine crawl the links from them and categorising them as Linux sites. If you can get to a link within a reasonable number of links from a bona fide Linux site, it will also be a Linux site. It may also be a gardening site, and can be reached from gardening sites as well. Tada, two categories. etc.
    • Search result moderation. Registered users of the search engine can moderate search results. If you're searching recipe sites for a recipe (using that category thing above, remember?) and a pr0n site comes up, you can select "irrelevant, it's a porn site" and press the moderate button at the bottom.

    It's not perfect, but it's gotta be better than the garbage we put up with now.
  • How clever. When somebody on that site requests robots.txt, it adds the IP to a robots list and the next time that you get the home page, it returns it to you without a session ID. In robots.txt there is a link that you can go to to remove yourself from this list.



    I think that would qualify for a patent. Go for it. Its a great idea.



    I just made the entire site unusuable by my entire company by viewing the robots.txt. How proxy server friendly. I hope nobody tries to look at the robots.txt file through an AOL connection.

  • If ebay has their way, indexing data is equivalent to cracking into another's system illegally

    I think what you meant to say was "If ebay has their way, accessing a copyrighted database and publishing information from it after explicitly being explicitly told not to is equivalent to cracking into another's system illegally."

    I guess that means that we should do away with all search engines entirely...
    I'm afraid you're right. We're pretty close to a time when most web pages will be served up programmatically from what amount to copyrighted databases. Indexing such sites without explicit permission from the content owners would be legally risky.

  • Standard disclaimer: IANALG (I am not a Linux geek.) Rather, I'm a web design geek. So please, be nice.

    From what I understand, what a lot of the OpenSource movement is about is doing it yourself if you don't like how it's being done now. Don't like commercial Unix, Linus? Make your own fscking Linux and let everyone contribute. Oh yeah: Give it away free to really piss people off.

    In this discussion, there are a ton of excellent ideas for how search engines should operate. Yet no one, to my knowledge, has put forward the next logical step: Build our own search engine. Google is a good start but hey, I know you guys could build it better. Worried about hardware and bandwidth costs? Venture capital.

    As I said, do it yourself. :-)

  • by DHartung (13689) on Tuesday December 14, 1999 @03:20PM (#1466270) Homepage
    Don't overlook XML-RPC [xmlrpc.com], which builds on the XML spec to provide a way of serving data over the web to remote clients.

    Then there's RSS [webreview.com], which is a way of serving up a news channel or other changing data. These applications are here and in use. Together, these XML-based technologies will someday provide the data layer for the software agents of the future. Read lately about that new "price-checker" technology? Imagine being the one business that doesn't serve up your product list and pricing to that agent.

    An interface from XML to these "hidden" databases is only a matter of time. We're just caught right now at a moment between technologies: the authoring tools don't really exist.
    ----
  • PHP Builder ran an article [phpbuilder.com] describing how you can have Apache Webserver [apache.org] treat a certain "directory" as a script, using the Location [apache.org] directive. So if I had a script file name called www.mydomain.com/foo then I could access www.mydomain.com/foo/param1/param2 and the foo script would run, and could use environment variables to find the "path" foo/param1/param2. I tried it, and it works quite well. This hides get parameters as "paths" so that search engines don't think the pages are dynamic (this is how Amazon.com works)
  • ...but you made an elaborate site entirely dependent on Microsoft Active Server Pages, and you're expecting it to work with _any_ web standards, much less be indexable and spiderable as if it was proper HTML? I'm afraid that you stepped right into that one. Look on the bright side- were it not for this Ask Slashdot article, you might never have known you weren't indexable, as this seems to be a little known fact! That alone is rather shocking.

Money is the root of all wealth.

Working...