Internet Searching Using Regular Expressions?

Internet Searching Using Regular Expressions? 8

Posted by Cliff on Friday March 16, 2001 @06:02AM from the incomprehensible-search-but--patters-better-results dept.

/[Aa]non([aiy]mo?u?s)? [KkCc]ow[ae]rd/ asks: "Remarkably few people have a working understanding of regular expressions. But, those that do know how useful they can be for searching text. Has anyone out there seen a large search engine (like Google) that will take regular expressions for queries? How about a newsgroup search engine?" Aside from the fact that many regular expressions read like snippets of line noise, they are the best thing I've seen for searches, and it's a lot easier than -adding +alot +of +search -terms.

Internet Searching Using Regular Expressions?

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 8 Comments Log In/Create an Account

Comments Filter:

Re:Performance? (Score:1)

by dougmc ( 70836 ) writes:

However, doing the prelimiary match using the regexp would definately be resource-prohibitive. In the above example, you would have to read the text of each file in to do the regexp. Not to mention the cost of keeping the text around.

You do it by looking for the words in the regex, then you search for those (the words can be indexed -- the regex can't), then you apply the regex to the results to further narrow them down. This his how glimpse/webglimpse (www.webglimpse.net [webglimpse.net]) works.
However, this all falls apart when the keywords are all very common, because nearly every page everywhere will contain them, and so the actual regular expression search will have to search thousands of pages. But for uncommon words, it works fairly well.
glimpse does do this, but it has problems with memory usage and being slow at times -- probably exactly because it does this.
Sorry, but I don't know of any full-Internet search engine that allows this. Your best bet is probably to write something that looks for the keywords in a regex, feeds them to google, then downloads every page that matches and then runs the regex on your own computer to further narrow down the results. Depending on how common your keywords are, it may work well, or it may try to download half the Internet.
Re:FP RegEx (Score:1)

by Doctor Dark ( 87531 ) writes:

Try some of these... http://www.leidenuniv.nl/ub/biv/specials.htm
Glimpse / Webglimpse (Score:1)

by polymath69 ( 94161 ) writes:

Some sites are indexed for quick searching using Webglimpse [webglimpse.net], which supports modified RE searching. There's a short list of such sites here [webglimpse.org].
Unfortunately I don't know of any Web-wide RE-capable database. Here's hoping someone downthread does...

--
Re:Change the Storage Method [Was: Re:Performance? (Score:1)

by fwc ( 168330 ) writes:

I should have been more specific. This method would work good in an example like above, but what if you are talking about a search something like:
[JjFf][ae][nb]\s*[0-9]{1,2},{0,1}\s*[0-9]{2,4}
This gets really messy really fast and exactly *HOW* are you going to do a query on a keyed database using this?
Change the Storage Method [Was: Re:Performance?] (Score:1)

by SagSaw ( 219314 ) writes:
Here is how I would try to do this:
1. Get an insanly fast (and probably expensive) database.
2. In this database, I would store the location of every word in each document. (ie. somewhere is words 17 and 87, over is word 45, there is words 98, 109, and 1)
3. When somebody searches for 'somewhere*over*there' I would query the database for a list of the locations of 'somewhere', 'over', and 'there', sorted by the document they originated from.
4. Now it should be easy to find the documents which have the words in the right order. Simple look for documents where the location somewhere is less than the location of over is less than the location of there.
  
  This would probably require a very fast, large database to accomplish, but it could be done.
Re:Performance? (Score:1)

by Carl Drougge ( 222479 ) writes:

I don't know about the performance of it, but postgresql supports regexps, something like this as I recall:
SELECT whatever FROM somewhere WHERE something ~ /somewhere.*over.*there/
Performance? (Score:2)

by fwc ( 168330 ) writes:

I'm trying to envision how you could do this in a reasonably fast manner on a very large database. From what I know about regular expressions, they can be VERY cpu intensive even on a small file. For instance, how would you match:
somewhere.*over.*there
Across an entire internet sized search engine? I guess you could pre-select documents containing somewhere and over and there and then proceed with screening them through a "standard" regexp search.
However, doing the prelimiary match using the regexp would definately be resource-prohibitive. In the above example, you would have to read the text of each file in to do the regexp. Not to mention the cost of keeping the text around.
That said, I can see how you could implement a regexp-like front end to a search tool if you had some restrictions as to what you could do with the regular expressions. However, I suspect the idea was more to be able to do advanced conditionals and other funky stuff within the regular expressions, and limiting this would probably limit the usefullness of the product.
So, maybe to summarize my rambling, the initial hurdle would be to re-invent the way normal regexps work in order to be efficient in a multi-giabyte database.
^I love regular expressions, but$ (Score:3)

by human bean ( 222811 ) writes: on Friday March 16, 2001 @02:47PM (#359927)

If you read Don Knuth's volume two, and do some elementary math, you can get a feel for what this would take.
Best to just download search results with a spider and hit them with grep, if you've got the time. [sigh].

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Internet Searching Using Regular Expressions? 8

Internet Searching Using Regular Expressions? More Login

Internet Searching Using Regular Expressions?

Re:Performance? (Score:1)

Re:FP RegEx (Score:1)

Glimpse / Webglimpse (Score:1)

Re:Change the Storage Method [Was: Re:Performance? (Score:1)

Change the Storage Method [Was: Re:Performance?] (Score:1)

Re:Performance? (Score:1)

Performance? (Score:2)

^I love regular expressions, but$ (Score:3)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot