Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
×
Software Spam Databases IT

Ask Slashdot: Speeding Up Personal Anti-Spam Filters? 190

New submitter hmilz writes "I've been using procmail for years to filter my incoming mail, and over time a long list of spam patterns was created. The good thing about the patterns is, there are practically no false positives, and practically no false negatives, i.e. I see each new spam exactly once, and lose no legit mail. This works by using an external spam-patterns file, containing one pattern per line, and running an 'egrep -F' against it. As simple as this is, with a long pattern list this becomes rather slow and CPU consuming. An average mail currently needs about 15 seconds to be grepped. In other words, this has become quite clumsy over time, and I would like to replace it by a more (CPU, hence energy) efficient method. I was thinking about a small indexed database or something. What would you recommend and use if you were me? Is sqlite something to look at?"
This discussion has been archived. No new comments can be posted.

Ask Slashdot: Speeding Up Personal Anti-Spam Filters?

Comments Filter:
  • bogofilter (Score:5, Informative)

    by jon787 ( 512497 ) on Friday August 30, 2013 @09:08PM (#44721441) Homepage Journal

    http://bogofilter.sourceforge.net/ [sourceforge.net]

    I haven't timed it to see how well its been doing in the 6 years I've had it though.

  • by Anonymous Coward on Friday August 30, 2013 @09:11PM (#44721455)

    Sorry, couldn't resist the pun.

    Your problem (besides not using existing Bayesian tools...) is that every single egrep is a fork. As others have pointed out, you should rewrite your script in something like Python and use the native regex libraries. Even if you have to read and 'compile' the regex list every time, you're saving a *massive* amount of OS-level overhead.

  • by PetiePooo ( 606423 ) on Friday August 30, 2013 @09:30PM (#44721541)

    ...Most of your time is likely spent parsing the patterns.

    I second that. And as your rules have built up, there are likely some that have never been used beyond when they were first put in. I'd instrument your next solution to identify outliers and cull them over time so your parser doesn't have to work so hard.

  • Re:Or... (Score:5, Informative)

    by bmo ( 77928 ) on Friday August 30, 2013 @09:35PM (#44721573)

    Well, the OP wasn't exactly clear if it was just his personal account or whether it's a corporate server. My "least amount of work" thing is to forward every email address I have to gmail, and pull mail from there via imap. I get a few hundred spams a day just on one mail account, and I haven't lost any real mail due to Gmail's filtering.

    >postini

    You do realize that is being EOLed, yes?

    http://postini-transition.googleapps.com/ [googleapps.com]

    >why gmail filters better than postini

    Probably separate spam databases. Stuff like that happens. Gmail probably gets orders of magnitude more spam to "teach" the system.

    YMMV.

    Using grep and procmail is the stone-knives-and-bearskins approach to filtering. There are a lot of other filtering systems that will be much more efficient on Unix systems. He can begin by using greylisting to filter out the non-compliant "fire and forget" spambots and then filter the winnowed pile o' crap. At least greylisting's not server intensive (it throws the load back to the sender) since 5xx and 4xx errors are cheap.

    Also blocking mail from dynamic IPs is a good idea.

    At this point he can then run the mail through a series of weighted RBLs. Reach a certain score and it's tossed. That's the processor intensive bit, but it's at the end and non-intensive filtering has already happened.

    --
    BMO

  • Re:spamassassin (Score:5, Informative)

    by dbIII ( 701233 ) on Friday August 30, 2013 @10:20PM (#44721757)
    There is still stuff going on in the dev version with an svn commit listed on August 30 2013.
    http://spamassassin.markmail.org/search/?q=#query:%20list%3Aorg.apache.spamassassin.commits+page:1+state:facets
  • by Arrogant-Bastard ( 141720 ) on Friday August 30, 2013 @10:33PM (#44721801)
    If spam has made it far enough that it's actually reached your personal instance of procmail, then there's been a problem earlier in the chain. Procmail rulesets should be a last resort, and they should only be asked to deal with minor issues that aren't dealt with via earlier rulesets.

    The first line of defense are your perimeter routers. They should implement BCP 38, they should block bogons, and they should bidirectionally deny all traffic to/from the Spamhaus DROP list. In addition, they should block inbound port 25 traffic from everywhere on the planet that you don't need email from. In other words; the fact that someone in country X wants to email you is unimportant unless you actually wish to receive mail from them. Yes, this is a reversal of default-permit, for a simple reason: default-permit for SMTP stopped being reasonable around 2000. Use http://www.ipdeny.com/ [ipdeny.com] to pick up the ranges per-country and only permit what you need. (Obviously a major research university can't do this. But Joe's Furniture, which does not have customers in Peru or Pakistan or Greece, can.)

    Then use blacklists, the best defense against spam we've ever developed. (Source: 30+ years of email experience) Spamhaus's Zen blacklist is a good one with a low FP rate and a tolerable FN rate. Augment these with local blacklists based on domains and network allocations. Augment those with as much blocking of generic hostnames and dynamic IP space as possible: real mail servers have real hostnames and are on static addresses.

    Then enforce RFC requirements: sending host must have rDNS, that PTR must resolve, what it resolves to should be the sending host's IP. Sending host must HELO as FQDN or bracketed dotted-quad; if FQDN, must resolve. Sending host must not send traffic pre-greeting. And so on. Enforcing these DOES mean occasionally you block mail sent by non-spamming entities: but since they are incompetent non-spamming entities, why would you want mail from them?

    Add greylisting. It'll handle a lot of annoying hosts that haven't learned to retry yet.

    Rate-limit based on normative values for your site. For example: if analysis of a year's worth of mail logs shows that during that time you never received more than 10 messages a day from ANY host, then rate-limit at 30 or 40. You'll never hit in normal practice; but if you get hammered by a fast-sending host, you'll blunt the attack. Note that these don't have to be perfect to work: provided you send deferrals (SMTP response codes 4xx) instead of refusals (5xx) the worst that happens is that you will mistakenly impose a delay.

    There's more -- it's possible to get quite crafty about this. But note that NONE of these measures pay any attention to content. There's a reason for that: spammers can defeat content-based measures at will. They won't have it so easy with these.

    Deployed in production in various setups ranging from a dozen to eight million users, these steps yield a FP rate of about 10e-6 to 10e-7 and a FN rate around 10e-5 to 10e-6. Tuning helps, of course: initial rates can be higher but log analysis (which all sensible postmasters do) readily brings them down. If you have the luxury of running your own mail server just for yourself, then you can REALLY tune this setup: you should be able to get the FN rate down to 10e-7 after a few months.
  • Re:spamassassin (Score:5, Informative)

    by wvmarle ( 1070040 ) on Friday August 30, 2013 @10:37PM (#44721823)

    Add greylisting to the mix. For me it stops approx. 90% of junk at the gate. That alone saves >90% of your server's spam workload (90% of the spam checker; a bit extra due to the mail server not having to process the mail at all).

    Of course I don't know about legitimate mail but if someone is trying to send legitimate mail trough a spam-type minimised mail server that doesn't retry, that's their problem...

  • Re:spamassassin (Score:5, Informative)

    by bill_mcgonigle ( 4333 ) * on Friday August 30, 2013 @11:48PM (#44722053) Homepage Journal

    The rules sets are updated pretty frequently - that's where the front lines of the battle are. As others have said, the engine is pretty mature.

    The question, I guess, is what do you want spamassassin to do that can't be expressed with the current rules language?

Our OS who art in CPU, UNIX be thy name. Thy programs run, thy syscalls done, In kernel as it is in user!

Working...