Any Interest in a Regexp-Based Web Search Engine? 51
K-Man asks: "From time to time, I've seen people comment that they would be interested in searching the web with regular expressions, but I've seen very little research in this area. Over many months (as part of a project I call 'grepple'), I've gradually assembled some background on the idea (also some work-in-progress not noted in the link), and the idea seems to be approaching the realm of technical possibility. However, my expertise is not in marketing, so I have no idea whether anybody would use this capability. So I ask, if you could search the web for any regular pattern, including html, partial words or wildcards, long phrases, or anything you might grep out of an html file, would you do it? What types of searches would you do?"
I'm a spammer - I'd use the following query: (Score:5, Insightful)
+1 Funny/insightful on the MQR standard (Score:5, Informative)
You have a point, but I have no mod points at the moment, save the ones I coin myself. Any new ability will invite new abuses (or, at least, new forms of old ones).
-- MarkusQ
P.S. For the regexp challenged, the parent poster was showing how easy it would be to use a rexular expression search engine to harvest e-mail addresses which the Bad Guy could then send spam to.
Re:+1 Funny/insightful on the MQR standard (Score:2)
Re:+1 Funny/insightful on the MQR standard (Score:1)
Thanks Dmitry. And keep defending folks like him, EFF.
Re:+1 Funny/insightful on the MQR standard (Score:2)
I would worry less about finding addresses and more about how the e-mail is being sent. The new virus hitting MS XP, W2K, and NT systems is far more of problem than is this. If ppl would shut down the few Open UNix boxes and the many Exchange servers, then the spamers would be trouble.
Re:+1 Funny/insightful on the MQR standard (Score:1)
OK I give up. What is it?
Re:+1 Funny/insightful on the MQR standard (Score:2)
Rhapsody in blue (try it on a piano some time).
-- MarkusQ
Re:I'm a spammer - I'd use the following query: (Score:1)
Good example (Score:2)
There are a lot of html tag fragments (eg img tags with just part of the src url, href's to a given domain or subsection of a website, etc.) that might be handy to find, at least for technical people.
I share most people's skepticism about regexp's ever becoming mainstream, but it might be a good foundation for value-added services, like finding web pages by color, font, number of images, etc.
Re:Good example (Score:1)
So you found the whole Internet ... (Score:2)
The original post is sounding more and more like FUD to me.
Interest? Sure. (Score:4, Insightful)
But it seems that you'd have a huge performance problem you'd have to work around. Search engines work by indexing the words as-is. Since you can't do that with a regexp search, I can't see any way that you could have a regexp search engine that didn't have to scan every page for every new search.
Full text indexing (Score:4, Interesting)
The penalty is a bit of space for indexing, but methods for compressed indexing have been found which use only about 40% of the source text size to hold both the index and the source text.
IMHO, much of the performance problem has already been solved, so the question is really whether people would use a tool if it were developed.
Fallible memory, etc (Score:5, Insightful)
- the obvious one is 'stem*' to get all words that begin with a certain string, but sometimes I might want the opposite '*ending' as well
- if I'm unsure of the spelling, 'start?end' could come in handy
- most search-engines are useless for specifying punctuation or capitalization
- I'd like to be able to search for ranges of dates using '18??' or the equivalent
- phrases with gaps or alternate forms ("All your [x] belong to [y]")
My recommendation would be to start with strong-content sites (Project Gutenberg, Wired, etc) and see how computationally expensive it becomes, one step at a time.
Re:Fallible memory, etc (Score:1)
http://www.google.com/search?q=%22all+your+*+ar
Re:Fallible memory, etc (Score:2)
Very useful to look for a file or a set of files.
regex:/href=".*cowboyneal\.jpg"/i
Re:Fallible memory, etc (Score:1)
Manually desensitizing ([Hh][Rr][Ee][Ff]) doesn't have a performance penalty to speak of, so that would have to be done behi
Re:Fallible memory, etc (Score:3, Interesting)
I'd be interested, mostly to exclude search hits that were not related to the topic of interest by anything other than an accident of vocabulary.
For example, if I wanted to search for the use of "Star Wars" in relation to the "Space Defense Initiative" and am not interested in the movie "Star Wars", I would very much like to have a search of "Star Wars !movie". I don't think Google can do this very well, although I haven't tried much either. Another example would be multiple operators, eg +(Apple AND/OR
Tip: How to exclude a word in Google (Score:2)
if I wanted to search for the use of "Star Wars" in relation to the "Space Defense Initiative" and am not interested in the movie "Star Wars", I would very much like to have a search of "Star Wars !movie". I don't think Google can do this very well
Google has an exclude function ("star wars" -movie [google.com]) but because it isn't artificially intelligent, it doesn't exclude movie merchandise such as action figures, computer games, card games, etc. Better: "star wars" -movie -lucas -lucasfilm -trilogy [google.com] whose tenth r
Re:Fallible memory, etc (Score:1)
So for example, "Fred* Bloggs" would search for all pages with "Bloggs" and then regex for the Fred part.
To do a date search, e.g. find events on your birthday - 29/02/???? - find all pages with "29/02" then find 29/02/????.
not per
too highly factorized (Score:3, Interesting)
the only way i can see it happening is an associated list of popular searches is entered into the db store, and regularly updated. sadly you're going up in factors, depending on how many expressions you have, so it'd be a huge db pull.
maybe... it's a cute idea. I'm sure something client side would be easier, with the advent of broadband in most homes.
Matt
Probably not... (Score:5, Insightful)
The two ones that come to mind are word stemming approaches and things trying to take advantage of processing that's closer to (though of course not necessarily reaching) natural language processing. Both of those improvements are really useful, and are at least possible to implement, though not easy.
Word stemming approaches eliminate the whole class of "I want every form of kill: kill(|ed|er|ing)" queries; plus you don't want a human to have to enumerate that.
Phrase alternations is already handle by existing syntax: "All your (base OR chili) is belong to (us OR them)." You don't need regexp for that.
Most of the rest of the examples of where a regexp might be useful are almost certainly toys, that sound like a cool hack but won't actually be useful.
Note that a counterexample requires not yet another probably-silly hack, but a plausible usecase where you have an example of something you were really searching for, that a regexp engine might have been able to solve, and that there was no good way of finding currently. In my experience the only searches that I can't do are the ones for things where there isn't a search term I can use that will unique identify what I'm looking for out of a sea of pages related to that term, but not what I'm looking for. One example I recall was looking for how poisonous a philodendren is to a cat; if the info is out there, it's swamped by pages saying simply that it is poisonous, with no indication of how much.
That's an example where a hypothetical search engine with better NLP might have helped me, where I could have asked it for only a page that included "how much" information about the poison level, and not its mere existance.
On the one hand, I'd take this with a grain of salt as I'm just a random Internet yahoo, and you'll always find someone who says "X won't work." You can't let that be a stopper. On the other hand, you might want to mull this over and be sure you are not being overoptimistic about the usefullness of this before committing much resources to it. In particular, I recommend scrutinizing your own usage of real search engines over the next few weeks, and ideally the usage of others, and make sure that you're sure your approach can beat Google in at least some useful domain. Overoptimistic assessments of one's own program is a very real danger of being a programmer and it has scuttled more then one project.
My overall view on the subject.. (Score:2)
Anything that displays data should allow you to search
Basically, absolutely everything should support regexp search, even if it doesnt make any fucking sense to do so.
Problem: the ways regular expressions work aren't anywhere near standard from program to program. Even a minor syntax change like "In this one, you need to put a slash before parens in order to make the parens special" vs "in this one, you need to put a slash before parens or else it is trea
Been there, done that, won't work (Score:5, Insightful)
(2) it is too easy to create a query that matches half the words in the index, bogging down to a crawl your search
(3) in all likelihood what you want is a stemmer and something that allows typos, not a full fledged regular expression matcher
(4) the main problem with search engines is that they return too many results, not too few. Regex search capabilities further increase the size of the result set.
(5) let me repeat point (3). Regular expressions are not a natural operation when searching natural language.
Re:Been there, done that, won't work (Score:4, Interesting)
At the technical level, one indexing method I'm currently looking at (the FM index [unipmn.it]) has a couple of advantages. First, it is incremental, extending a match one character at a time, and allows backtracking etc. to probe different legs of a regexp. It's also very quick at counting hits at each step, making realtime pruning of query results very easy.
hmmm (Score:2)
any way
I think it would be very cool and very usefull, if it could be done without scaling problems, I am not an Expert on RE's but I've always been told that they are slower than indexed lookups, and don't scale to masive quantities of info.
and if it can't be done without scaling problems...it could be done for a subset of the net. like find all matching entries of Regex1 within all Url's matching Regex2.
I find t
Re:hmmm - Bookmarklets! (Score:2)
One of them is here [squarefree.com], spawn your favorite search engine and look for bookmarklets, there are plenty.
Bookmarklets and smart bookmarks (not available in IE) can make magic and turn your browser into a very powerful process
Re:hmmm - Bookmarklets! (Score:1)
Thanks, this all good and is almost exacly what I want.
Would work on binaries too (Score:2)
URL's are a good example of difficult-to-parse search targets. At one time I was looking at parsing urls into components and searching those, but even then it was too hard to search with just a fragment.
It won't work (Score:3, Informative)
If you want to do searching of a small intranet, you might be able to get away with it. You might be able to do globbing, but currently using regexes won't work.
The main regex-related features I suspect people might want are:
* Phrases. Google and almost all other search engines can already do this, with quotes.
* NEAR. foo NEAR bar in the document requests documents where foo occurs "near" bar. This is of somewhat more dubious utility, but there are some searches that it's convenient for.
* Boolean NOT. Google and almost all other search engines can already do this.
NEAR in Google (Score:1)
NEAR. foo NEAR bar in the document requests documents where foo occurs "near" bar. This is of somewhat more dubious utility, but there are some searches that it's convenient for.
Google already does this to an extent, using NEARness of your search terms as one of the terms in the ranking equation.
I'd buy that for a dollar... (Score:1)
Well, I can't honestly say I'd pay for such a service, but even being able to do simple regex stuff like "There.*gun and gunshot.*who shot who" would be nice. I find that most of my regexp searches, even in grep, are just looking for parts of a sentence or code block using .* .
However, the above comment on how most people wouldn't be using regex in your engine is a valid one. You'd prolly want to pass off non-regex searches to a more suited engine (ie google), while handling the real searches yourself.
Re:I'd buy that for a dollar... (Score:2)
The spam thing is valid, although it's already hazardous to post an email on the web. Try typing your phone number into Google - that's another surprise that most people aren't a
Possibly... (Score:3, Informative)
Google has good ranking (Score:2)
Google doesn't have a mind-bendingly better selection system (it's a lot like any other search engine), but their ranking is, of course, their main advantage.
The issue for a search engine like google would be to cut over from a keyword-based inverted index to something a bit more flexible, while maintaining continuity with the current system.
I think it's possible. We have the technology...(cut to six million dollar man intro).
To the "It won't work" folks (Score:3, Interesting)
The most promising method for supporting this idea is a full-text index, one which allows any byte sequence in the source to be looked up quickly. That way, a regexp like
It's also important to know when no longer matches exist, for instance if "ab" has no hits, then "abi" doesn't need to be checked.
The big leap which makes this seem possible is Ferragina and Manzini's FM index [unipmn.it]. This method takes the size of a full-text index from somewhere around 10 times the source, to around 40%, including the text as well as the indexing. Their algorithm is described in relation to fixed-length patterns, but it's a trivial extension to handle regexp-generated sets of patterns as well.
In the past few weeks I've been working on an implementation of a similar algorithm with possible performance improvements.
So, the short answer is "yes, it's possible". There are a few hitches here and there, but in comparison to what I knew a year ago, it's much more workable than I would have guessed.
Re:To the "It won't work" folks (Score:2, Interesting)
This [mit.edu] gives a worst-case linear lower bound on the size of an index structure for substring search, which is obviously necessary for "full" regexp power. Of course, I doubt anyone really wants full regexps; the challenge you face is constructing a powerful enough subset that is easy to implement.
Personally, like other posters have mentioned, I am only really interested in stem searches such as stem*.
Why not start from Google (Score:1)
Or a simple perl script that searches the resutls giving back by the web site?
Re:Why not start from Google (Score:1)
Firstly the google API will only return 10 results at a time, IIRC, meaning that it wouldn't be possible to meaningfully rank the sites unless you entered a loop to get all the search results from Google - and there could be lots.
Secondly, it means that you need to be able to search Google first before you can pass a regex filter over it - and what string would you use to search Google with?
Even if you could get Google to return likely pages for the regex, you'd still
Soundex or even better Daitch-Mokotoff (Score:1)
Daitch-Mokotoff is able to handle many languages compared to the almost English exclusive Soundex, so I would rather use this algorithm.
And I don't think it would hinder performance that much since you can cache results just as you can with normal queries.
intelligent searching (Score:1)
Thunderstone Texis... (Score:2, Informative)
Check it out: http://www.thunderstone.com/ [thunderstone.com]
Yes! (Score:2, Interesting)
Actually, I'd be pretty satisfied if Google supported the advanced boolean search that Altavista has. When Altavista had one of, if not the, best databases I regularly used it. Take a look at:
Altavista Special Search Terms [altavista.com]
I find that a combination of wildcards, AND, OR, NEAR, NOT, grouping via parentheses and being able to search specifically for a
google api (Score:1, Interesting)
The results are fairly good. It's on sourceforge if anyone wants to use it. They seem to be down right now, or I'd give the url.
s/? (Score:1)