Ask Slashdot: Speeding Up Personal Anti-Spam Filters? 190
New submitter hmilz writes "I've been using procmail for years to filter my incoming mail, and over time a long list of spam patterns was created. The good thing about the patterns is, there are practically no false positives, and practically no false negatives, i.e. I see each new spam exactly once, and lose no legit mail. This works by using an external spam-patterns file, containing one pattern per line, and running an 'egrep -F' against it. As simple as this is, with a long pattern list this becomes rather slow and CPU consuming. An average mail currently needs about 15 seconds to be grepped. In other words, this has become quite clumsy over time, and I would like to replace it by a more (CPU, hence energy) efficient method. I was thinking about a small indexed database or something. What would you recommend and use if you were me? Is sqlite something to look at?"
You could speed up your current solution (Score:5, Interesting)
ragel (Score:2, Interesting)
Try compiling your patterns using Ragel: http://www.complang.org/ragel/
Union them all together and you'll see orders of magnitude improvement in performance (e.g. 10x - 100x) over other regular expression engines, although GNU grep is using Aho–Corasick with the -F switch, so you're likely to see less of an improvement.
Many people use re2c, but it has nowhere near the performance or capabilities of Ragel. Ragel has a steep learning curve, but it's well worth the effort to master. It's well maintained, and has been for years.
Re:Or... (Score:4, Interesting)
Is it? I've never had a false positive in all the years I've been using GMail.
That you noticed. There's a fairly high bias inherent there; it just has to not have hit something that was both noticeable and that you knew was incoming.
Re:Problem spotted. (Score:4, Interesting)
Re:Or... (Score:5, Interesting)
Email from people at one site I look after used to vanish into a black hole at gmail until I convinced them to replace the GIF of their corporate logo attached to all their emails with a PNG version. That's some real mail lost due to gmail's filtering.
IMHO it's better to do the filtering somewhere where you have access to the stuff that is discarded. False positives may be rare now but they still happen. That's why I like stuff such as MailScanner (open source wrapper for spamassassin+your choice of commercial antivirus and/or clamav+other open source stuff+distributed updating rulesets) run on site. There's plenty of others that give you this function including some of the commercial "appliances" and outsourced email filtering.
It used to be the case that one IP address I have a mail server on would get blocked for a couple of days every year because some idiot at a blacklist would load in an obsolete list of dynamic IP addresses from what is now a decade ago. As IPv4 addresses diminish expect the lists of dynamic addresses to become outdated very quickly.