Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
×
Software Spam Databases IT

Ask Slashdot: Speeding Up Personal Anti-Spam Filters? 190

New submitter hmilz writes "I've been using procmail for years to filter my incoming mail, and over time a long list of spam patterns was created. The good thing about the patterns is, there are practically no false positives, and practically no false negatives, i.e. I see each new spam exactly once, and lose no legit mail. This works by using an external spam-patterns file, containing one pattern per line, and running an 'egrep -F' against it. As simple as this is, with a long pattern list this becomes rather slow and CPU consuming. An average mail currently needs about 15 seconds to be grepped. In other words, this has become quite clumsy over time, and I would like to replace it by a more (CPU, hence energy) efficient method. I was thinking about a small indexed database or something. What would you recommend and use if you were me? Is sqlite something to look at?"
This discussion has been archived. No new comments can be posted.

Ask Slashdot: Speeding Up Personal Anti-Spam Filters?

Comments Filter:
  • by russotto ( 537200 ) on Friday August 30, 2013 @09:00PM (#44721419) Journal
    Write something that uses a regular expression library (RE2 would be ideal, if your expressions are actually regular), and keeps the compiled patterns resident. Most of your time is likely spent parsing the patterns.
  • ragel (Score:2, Interesting)

    by Anonymous Coward on Friday August 30, 2013 @09:14PM (#44721471)

    Try compiling your patterns using Ragel: http://www.complang.org/ragel/

    Union them all together and you'll see orders of magnitude improvement in performance (e.g. 10x - 100x) over other regular expression engines, although GNU grep is using Aho–Corasick with the -F switch, so you're likely to see less of an improvement.

    Many people use re2c, but it has nowhere near the performance or capabilities of Ragel. Ragel has a steep learning curve, but it's well worth the effort to master. It's well maintained, and has been for years.

  • Re:Or... (Score:4, Interesting)

    by Count Fenring ( 669457 ) on Friday August 30, 2013 @10:08PM (#44721711) Homepage Journal

    Is it? I've never had a false positive in all the years I've been using GMail.

    That you noticed. There's a fairly high bias inherent there; it just has to not have hit something that was both noticeable and that you knew was incoming.

  • Re:Problem spotted. (Score:4, Interesting)

    by complete loony ( 663508 ) <Jeremy@Lakeman.gmail@com> on Friday August 30, 2013 @10:31PM (#44721793)
    If you have sufficient programming experience, I'd recommend basing this solution on redgrep [google.com]. It's an llvm based expression compiler that should be able to combine multiple expressions into a single machine code state machine, assuming it doesn't run out of memory in the process. With a bit of effort you could output all of your compiled expressions into a single executable so you'll only need to wait for the compilation time when you add more filters.
  • Re:Or... (Score:5, Interesting)

    by dbIII ( 701233 ) on Friday August 30, 2013 @10:37PM (#44721821)

    and I haven't lost any real mail due to Gmail's filtering.

    Email from people at one site I look after used to vanish into a black hole at gmail until I convinced them to replace the GIF of their corporate logo attached to all their emails with a PNG version. That's some real mail lost due to gmail's filtering.

    IMHO it's better to do the filtering somewhere where you have access to the stuff that is discarded. False positives may be rare now but they still happen. That's why I like stuff such as MailScanner (open source wrapper for spamassassin+your choice of commercial antivirus and/or clamav+other open source stuff+distributed updating rulesets) run on site. There's plenty of others that give you this function including some of the commercial "appliances" and outsourced email filtering.

    Also blocking mail from dynamic IPs is a good idea.

    It used to be the case that one IP address I have a mail server on would get blocked for a couple of days every year because some idiot at a blacklist would load in an obsolete list of dynamic IP addresses from what is now a decade ago. As IPv4 addresses diminish expect the lists of dynamic addresses to become outdated very quickly.

Top Ten Things Overheard At The ANSI C Draft Committee Meetings: (5) All right, who's the wiseguy who stuck this trigraph stuff in here?

Working...