Forgot your password?
typodupeerror
The Internet

Internet Searching Using Regular Expressions? 8

Posted by Cliff
from the incomprehensible-search-but--patters-better-results dept.
/[Aa]non([aiy]mo?u?s)? [KkCc]ow[ae]rd/ asks: "Remarkably few people have a working understanding of regular expressions. But, those that do know how useful they can be for searching text. Has anyone out there seen a large search engine (like Google) that will take regular expressions for queries? How about a newsgroup search engine?" Aside from the fact that many regular expressions read like snippets of line noise, they are the best thing I've seen for searches, and it's a lot easier than -adding +alot +of +search -terms.
This discussion has been archived. No new comments can be posted.

Internet Searching Using Regular Expressions?

Comments Filter:
  • However, doing the prelimiary match using the regexp would definately be resource-prohibitive. In the above example, you would have to read the text of each file in to do the regexp. Not to mention the cost of keeping the text around.
    You do it by looking for the words in the regex, then you search for those (the words can be indexed -- the regex can't), then you apply the regex to the results to further narrow them down. This his how glimpse/webglimpse (www.webglimpse.net [webglimpse.net]) works.

    However, this all falls apart when the keywords are all very common, because nearly every page everywhere will contain them, and so the actual regular expression search will have to search thousands of pages. But for uncommon words, it works fairly well.

    glimpse does do this, but it has problems with memory usage and being slow at times -- probably exactly because it does this.

    Sorry, but I don't know of any full-Internet search engine that allows this. Your best bet is probably to write something that looks for the keywords in a regex, feeds them to google, then downloads every page that matches and then runs the regex on your own computer to further narrow down the results. Depending on how common your keywords are, it may work well, or it may try to download half the Internet.

  • Try some of these... http://www.leidenuniv.nl/ub/biv/specials.htm
  • Some sites are indexed for quick searching using Webglimpse [webglimpse.net], which supports modified RE searching. There's a short list of such sites here [webglimpse.org].

    Unfortunately I don't know of any Web-wide RE-capable database. Here's hoping someone downthread does...

    --

  • I should have been more specific. This method would work good in an example like above, but what if you are talking about a search something like:

    [JjFf][ae][nb]\s*[0-9]{1,2},{0,1}\s*[0-9]{2,4}

    This gets really messy really fast and exactly *HOW* are you going to do a query on a keyed database using this?

  • Here is how I would try to do this:

    1. Get an insanly fast (and probably expensive) database.
    2. In this database, I would store the location of every word in each document. (ie. somewhere is words 17 and 87, over is word 45, there is words 98, 109, and 1)
    3. When somebody searches for 'somewhere*over*there' I would query the database for a list of the locations of 'somewhere', 'over', and 'there', sorted by the document they originated from.
    4. Now it should be easy to find the documents which have the words in the right order. Simple look for documents where the location somewhere is less than the location of over is less than the location of there.

      This would probably require a very fast, large database to accomplish, but it could be done.

  • I don't know about the performance of it, but postgresql supports regexps, something like this as I recall:

    SELECT whatever FROM somewhere WHERE something ~ /somewhere.*over.*there/

  • I'm trying to envision how you could do this in a reasonably fast manner on a very large database. From what I know about regular expressions, they can be VERY cpu intensive even on a small file. For instance, how would you match:

    somewhere.*over.*there

    Across an entire internet sized search engine? I guess you could pre-select documents containing somewhere and over and there and then proceed with screening them through a "standard" regexp search.

    However, doing the prelimiary match using the regexp would definately be resource-prohibitive. In the above example, you would have to read the text of each file in to do the regexp. Not to mention the cost of keeping the text around.

    That said, I can see how you could implement a regexp-like front end to a search tool if you had some restrictions as to what you could do with the regular expressions. However, I suspect the idea was more to be able to do advanced conditionals and other funky stuff within the regular expressions, and limiting this would probably limit the usefullness of the product.

    So, maybe to summarize my rambling, the initial hurdle would be to re-invent the way normal regexps work in order to be efficient in a multi-giabyte database.

  • by human bean (222811) on Friday March 16, 2001 @01:47PM (#359927)
    If you read Don Knuth's volume two, and do some elementary math, you can get a feel for what this would take.

    Best to just download search results with a spider and hit them with grep, if you've got the time. [sigh].

"Our reruns are better than theirs." -- Nick at Nite

Working...