Ask Slashdot: Speeding Up Personal Anti-Spam Filters?

Slashdot is powered by your submissions, so send in your scoop

Ask Slashdot: Speeding Up Personal Anti-Spam Filters? 190

Posted by timothy on Friday August 30, 2013 @08:52PM from the backup-your-email-in-case-of-rogue-filter dept.

New submitter hmilz writes "I've been using procmail for years to filter my incoming mail, and over time a long list of spam patterns was created. The good thing about the patterns is, there are practically no false positives, and practically no false negatives, i.e. I see each new spam exactly once, and lose no legit mail. This works by using an external spam-patterns file, containing one pattern per line, and running an 'egrep -F' against it. As simple as this is, with a long pattern list this becomes rather slow and CPU consuming. An average mail currently needs about 15 seconds to be grepped. In other words, this has become quite clumsy over time, and I would like to replace it by a more (CPU, hence energy) efficient method. I was thinking about a small indexed database or something. What would you recommend and use if you were me? Is sqlite something to look at?"

This discussion has been archived. No new comments can be posted.

Ask Slashdot: Speeding Up Personal Anti-Spam Filters?

Load All Comments

Search 190 Comments Log In/Create an Account

Comments Filter:

spamassassin (Score:5, Insightful)

by mdaitc ( 619734 ) writes: on Friday August 30, 2013 @08:56PM (#44721401)

have you tried spamassassin?

Share
twitter facebook
- Re: (Score:3)
  
  by Scutter ( 18425 ) writes:
  
  Latest News: 2011-06-16: SpamAssassin 3.3.2 has been released, a minor new release primarily to support perl-5.12 and later. Visit the downloads page to pick it up, and for more info.
  Last update was more than two years ago. I know you can refresh your rule sets periodically, but is the software even still maintained?
  - Re:spamassassin (Score:5, Informative)
    
    by dbIII ( 701233 ) writes: on Friday August 30, 2013 @10:20PM (#44721757)
    
    There is still stuff going on in the dev version with an svn commit listed on August 30 2013.
    http://spamassassin.markmail.org/search/?q=#query:%20list%3Aorg.apache.spamassassin.commits+page:1+state:facets
    
    Parent Share
    twitter facebook
  - Re:spamassassin (Score:5, Insightful)
    
    by wvmarle ( 1070040 ) writes: on Friday August 30, 2013 @10:39PM (#44721829)
    
    Maybe the software is pretty much finished? In that case there's not much more to do - no new features to add, and sooner or later you'll run out of bugs to fix.
    
    Parent Share
    twitter facebook
    - Re: (Score:3)
      
      by Scutter ( 18425 ) writes:
      
      I don't think there's any such thing as "pretty much finished", especially with a piece of software involved in the arms race that is spam vs. filtering. There's only so much you can do with rules before you need to revisit your engine. Also, it's not just the software that's been stagnant for two years. The website itself hasn't been updated in as long. Not a single news item since 2011. The other respondent mentioned that dev is still active, but dev is not production. Dev is dev. Ever since Spamas
      - Re:spamassassin (Score:5, Informative)
        
        by bill_mcgonigle ( 4333 ) * writes: on Friday August 30, 2013 @11:48PM (#44722053) Homepage Journal
        
        The rules sets are updated pretty frequently - that's where the front lines of the battle are. As others have said, the engine is pretty mature.
        The question, I guess, is what do you want spamassassin to do that can't be expressed with the current rules language?
        
        Parent Share
        twitter facebook
      - Re: (Score:2)
        
        by jgrahn ( 181062 ) writes:
        
        I don't think there's any such thing as "pretty much finished",
        There is; software designed according to "do one thing and do it well" ... for example the Unix cat(1) command is probably pretty stable by now. Same with fgrep(1).
        especially with a piece of software involved in the arms race that is spam vs. filtering.
        ... but yeah, well, I don't know Spamassassin but I suspect it has broader and more loosely-defined goals.
    - Re: (Score:2)
      
      by DNS-and-BIND ( 461968 ) writes:
      
      No, today's technology user has been brainwashed by mobile applications that update frequently. I have seen complaints on perfectly good software: "Has not been updated in a year whats wrong this software sux 1 star". Developers also use software updates as a sort of beta test: push it out, and if it crashes a lot of systems then update it again. Iterate as necessary. I've seen three releases in a day and five in a week using this "plan". The users don't help by considering mature (i.e. un-updated, ess
- Re:spamassassin (Score:5, Informative)
  
  by wvmarle ( 1070040 ) writes: on Friday August 30, 2013 @10:37PM (#44721823)
  
  Add greylisting to the mix. For me it stops approx. 90% of junk at the gate. That alone saves >90% of your server's spam workload (90% of the spam checker; a bit extra due to the mail server not having to process the mail at all).
  Of course I don't know about legitimate mail but if someone is trying to send legitimate mail trough a spam-type minimised mail server that doesn't retry, that's their problem...
  
  Parent Share
  twitter facebook
  - Re: (Score:3)
    
    by Architect_sasyr ( 938685 ) writes:
    
    One client in 4 years of greylisting has had that problem, for something like 40,000 unique senders per month. I like those numbers.
  - Re: (Score:2)
    
    by fredklein ( 532096 ) writes:
    
    Just switch over to Email Certification.
    Long story short, everyone who wants to send Certified mail has to be 'certified' by their ISP. (UN-certified mail would still be possible, if
    you wish.) Getting certified is nothing more than providing enough information to positively identify you, and costs a nominal fee.
    In return, you create a public/private key pair, and give the public one to the certifier. The private key goes into your email server, which
    adds some headers to each outgoing email. One of these is
    - Re: (Score:2)
      
      by fisted ( 2295862 ) writes:
      
      What a shitty idea.
- Re: (Score:3)
  
  by FridayBob ( 619244 ) writes:
  
  On the mail servers I maintain, I employ SpamAssassin only a last resort because it is resource-intensive. Submitter hmilz's approach is not only resource-intensive, but also labor intensive, so I would never recommend it.
  I've used Exim for my MTA since 2001 and my main defense against spam has always been to filter it out before SpamAssassin comes into play based on analysis of header information and checking against DNS black lists. Actually, the first thing I do is look for obvious fakes from a limite
- Re: (Score:2)
  
  by whoever57 ( 658626 ) writes:
  
  have you tried spamassassin?
  
  Indeed. I just looked at logs on a server that acts as an incoming mail filter for a small company. The range of times for spamassassin (spamd) to filter the incoming emails was about 1 to 7 seconds. with most being in the range of 2-4 seconds. This is without bypassing spamd for large emails (spam can be relied upon to be small)
- O Hai. Has this been posted? (Score:2)
  
  by symbolset ( 646467 ) * writes:
  
  The canonical spam solution checklist. [craphound.com]
  I'm going with Specificaly, your plan fails to account for: (x) Users of email will not put up with it.
  - Re: (Score:2)
    
    by Antique Geekmeister ( 740220 ) writes:
    
    Thank you for posting that checklist, that's a vital document for any spam planning.
    SpamAssassin, executed through procmail on the mail client's email, is indeed resource intensive and does not scale well for an organization. Other people have mentioned other upstream filtering techniques, such as grey listing and DNS blacklists, but those are limited because of the large numbers of zombied Windows clients around the world, which have their resources rented as botnets to send spam from legitimate environmen
- - Re: (Score:2)
    
    by xrayspx ( 13127 ) writes:
    
    There's a lot to do to SA to make it "good". I shared your opinion a year ago. I run a relatively low volume personal mail server for a few domains and a few users. I had SA, but it didn't do much, and I had bigger fish to fry dealing with much larger mail sites than my stupid personal nonsense. I typically get about 300-500 spams a day, and very few legit mails. I was getting false positives, so I'd just never see the mail, and tons of false negatives. About 20% of the daily spam was hitting my inbox
You could speed up your current solution (Score:5, Interesting)

by russotto ( 537200 ) writes: on Friday August 30, 2013 @09:00PM (#44721419) Journal

Write something that uses a regular expression library (RE2 would be ideal, if your expressions are actually regular), and keeps the compiled patterns resident. Most of your time is likely spent parsing the patterns.

Share
twitter facebook
- Re:You could speed up your current solution (Score:5, Informative)
  
  by PetiePooo ( 606423 ) writes: on Friday August 30, 2013 @09:30PM (#44721541)
  
  ...Most of your time is likely spent parsing the patterns.
  I second that. And as your rules have built up, there are likely some that have never been used beyond when they were first put in. I'd instrument your next solution to identify outliers and cull them over time so your parser doesn't have to work so hard.
  
  Parent Share
  twitter facebook
- Re: (Score:2)
  
  by grcumb ( 781340 ) writes:
  
  Write something that uses a regular expression library (RE2 would be ideal, if your expressions are actually regular), and keeps the compiled patterns resident. Most of your time is likely spent parsing the patterns.
  I'm probably going to get shat on by kids who don't know any better, but....
  Use Perl. If a complex set of regular expressions is taking 15 seconds per email, then there's clearly something wrong with the implementation. I suspect you're doing too much backtracking [regular-expressions.info]. I've been guilty of the same in the past. In one case, simply anchoring my regular expressions to the start and end of the string reduced running time literally by two orders of magnitude. Just glom the whole message into a string and go nuts.
  And
- Re: (Score:2)
  
  by niftymitch ( 1625721 ) writes:
  
  Write something that uses a regular expression library (RE2 would be ideal, if your expressions are actually regular), and keeps the compiled patterns resident. Most of your time is likely spent parsing the patterns.
  Yes but a more resource friendly set of tools might begin with the OP's procmail to move the mail
  onto a local machine quickly. Filters inside of procmail are hobbled. Do this as one message
  per file (http://unix.stackexchange.com/questions/62563/savings-emails-as-individual-files-using-procmail).
  Procmail locks and gates are OS dependent but still slow.
  Next test each message with one or more simple "grep expressions" that then pass it or gate it
  to more complex expressions. On a multi core machine with a
- Current solution is awful (Score:2)
  
  by goombah99 ( 560566 ) writes:
  
  Here's several things you can do to make this faster.
  1) first don't keep invoking egrep. this has to parse the command line and then re-load the egrep command itself every time. Instead do this from within a loaded program. Perl is a very good choice for this
  2) the perl command can pre-compile the regular expression. So you can leave the perl program running as a process then simply feed it new data to analyse.
  3) given you are searching for words, you probably want to split the incoming stream on white
Database? (Score:3, Insightful)

by K. S. Kyosuke ( 729550 ) writes: on Friday August 30, 2013 @09:01PM (#44721421)

What would the database achieve? I'm not sure what is the exact nature of the patterns (an example would really help here), but perhaps writing a compiler from the patterns into some decision procedure in something reasonably efficient yet featuring quick start, such as SBCL or Gambit, could help.

Share
twitter facebook
- Re: (Score:2)
  
  by jgrahn ( 181062 ) writes:
  
  What would the database achieve?
  Some people believe the solution to any problem involving data is a database ...
- Re: (Score:2)
  
  by swillden ( 191260 ) writes:
  
  If he's using egrep -F, the patterns are fixed strings.
bogofilter (Score:5, Informative)

by jon787 ( 512497 ) writes: on Friday August 30, 2013 @09:08PM (#44721441) Homepage Journal

http://bogofilter.sourceforge.net/ [sourceforge.net]
I haven't timed it to see how well its been doing in the 6 years I've had it though.

Share
twitter facebook
- Re: (Score:2)
  
  by SigmundFloyd ( 994648 ) writes:
  
  http://bogofilter.sourceforge.net/
  Seconded. Procmail + bogofilter + spam.mbox = no problem.
  I keep - and periodically review - a "spam" mbox for the rare false positive.
  I haven't timed it to see how well its been doing in the 6 years I've had it though.
  It's written in C, so it's very likely much faster and leaner than Spamassassin.
What a forking awful solution... (Score:2, Informative)

by Anonymous Coward writes:

Sorry, couldn't resist the pun.
Your problem (besides not using existing Bayesian tools...) is that every single egrep is a fork. As others have pointed out, you should rewrite your script in something like Python and use the native regex libraries. Even if you have to read and 'compile' the regex list every time, you're saving a *massive* amount of OS-level overhead.
Distribute (Score:2)

by manu0601 ( 2221348 ) writes:

It seems you could easily distribute the load on multiple machines, each doing a subset of the regex.
- - Re: (Score:2)
    
    by Cryacin ( 657549 ) writes:
    
    Yeah, just use "the cloud" to "drag and drop" your email into, so that you can "sanitise your synergy" and "reclaim your potential".
    
    Phew, enough markitechture for one day. I have to go shower now.
  - Re: (Score:2)
    
    by jgrahn ( 181062 ) writes:
    
    It seems you could easily distribute the load on multiple machines, each doing a subset of the regex.
    There's something hilarious about having to distribute email filtering across several machines.
    No; it's tragic that people reach for distributed, or "multi-core", whenever something runs too slowly. If filtering a mail according to a set of REs takes 15 seconds of CPU time like the OP writes, he's clearly doing something wrong, or hitting some limitation of procmail's design (being unable to amortize work, such as reading and parsing the REs, between mails).
    Distributing wasteful work is not the right solution, in particular not if energy efficiency is a major concern like the OP suggests. (And I fin
ragel (Score:2, Interesting)

by Anonymous Coward writes:

Try compiling your patterns using Ragel: http://www.complang.org/ragel/
Union them all together and you'll see orders of magnitude improvement in performance (e.g. 10x - 100x) over other regular expression engines, although GNU grep is using Aho–Corasick with the -F switch, so you're likely to see less of an improvement.
Many people use re2c, but it has nowhere near the performance or capabilities of Ragel. Ragel has a steep learning curve, but it's well worth the effort to master. It's well maintained,
DIY (Score:2)

by Jmc23 ( 2353706 ) writes:

http://www.gigamonkeys.com/book/practical-a-spam-filter.html [gigamonkeys.com] has the nuts and bolts. CL-PPCRE does perl regex matching faster than perl.
perl or python or whatever (Score:2)

by retchdog ( 1319261 ) writes:

I've heard, but never timed it myself, that perl is faster for regexp-type stuff than even the specialized tools, just from the massive amount of optimization it has accrued over the years; here [perlmonks.org] is a completely unbiased source. Use a perl or python script, and consider using Storable (perl) or pickle (python) to serialize the data structure, I guess, but just having the whole list in memory will help.
According to this [uidaho.edu], perl regexps are (unsurprisingly) a superset of egrep's.
I don't see how introducing SQL c
- Re: (Score:2)
  
  by Jmc23 ( 2353706 ) writes:
  
  regexps in cl-ppcre are faster than perl.
  - Re: (Score:3)
    
    by retchdog ( 1319261 ) writes:
    
    doing anything but repeated egreps is probably fast enough. he should do whatever is easiest, which probably isn't lisp.
    - Re: (Score:2)
      
      by Jmc23 ( 2353706 ) writes:
      
      Don't bring your prejudices into this!
      It doesn't get much easier than someone not only handing you the code but also holding your hand and walking through every single function. Unless you want to use a magic black box and where's the fun in that?
Matching multiple simultaneous regular expressions (Score:3)

by careysb ( 566113 ) writes: on Friday August 30, 2013 @09:36PM (#44721575)

Many years ago I worked with a Unix development tool called LEX that could handle matching multiple patterns simultaneously. Perhaps there is an updated tool that would do the same thing. Java has a 3rd party library called ANTLR that might do the trick. It would involved re-compiling every time a new pattern is added but it should be extremely fast.

Share
twitter facebook
Sqlite will be awesome (Score:3)

by swillden ( 191260 ) writes: <shawn-ds@willden.org> on Friday August 30, 2013 @09:36PM (#44721577) Journal

Sqlite, or anything that uses an index, will be screaming fast.
Your statement of your current solution makes me wonder, though.. are you using "egrep -F -f pattern_file e_mail_message"? Or are you running egrep many times, once per line of the pattern file, or once per line of the message? I would think that given a pattern file egrep would be smart enough to do something better than repeatedly scanning the input, but based on the time it's taking, it sounds like that's happening.

Share
twitter facebook
- Re: (Score:2)
  
  by neonsignal ( 890658 ) writes:
  
  I doubt that he is using "grep -F -f ...", because fgrep can search for a hundred thousand patterns in a megabyte of data in under a second even on a modest machine (and most of the time is building up the regex state machine). I suspect he is using "egrep -f", and lots of patterns with wildcards. Worse, he will be running it once on each email, which means rebuilding the regex state machine each time.
- Re: (Score:2)
  
  by jon3k ( 691256 ) writes:
  
  Can someone explain what the big O notation would be for this? I'm still trying to wrap my brain around big O notation.
  
  I'd think it would be O(n^2) but that can't be right because it's two different sets of data (not N raised to itself). So is there even a: O(N^X)?
  
  I'm assuming that for each mail message (outer loop) each RegEx is processed, which might be an incorrect assumption.
  - Re: (Score:2)
    
    by swillden ( 191260 ) writes:
    
    O(NM).
    If the pattern file has N lines and the e-mail has M lines, and if we count comparing one line of the pattern file against one line of the e-mail as one operation, then for each of the N lines of the file, we have to do M comparisons.
    There are better algorithms than this obvious one, though, and it would surprise me if egrep didn't use one of them when given the whole list of patterns at once.
- - Re: (Score:2)
    
    by nabsltd ( 1313397 ) writes:
    
    It's actually a single "egrep -i -o -f " per mail. For each mail, egrep is forked exactly once, which means writing my own tool will not reduce the OS overhead. I might give perl a try though, but I doubt that forking perl will be much faster than forking grep.
    So, stop forking. There are lots of spam filters that use perl as the engine and run as daemons with a socket to write to. This keeps the compiled perl regular expressions in memory (assuming you're not swapping because of low memory).
    Spamassassin can use pretty much the same file you have right now as a source for patterns, and runs quicker than what you are seeing for your setup. I don't use any custom rules for SA, and only see about 5 spam e-mails per week in my mail client, and all end up in the "ma
The simpler solution is ... (Score:2)

by Skapare ( 16644 ) writes:

... I just gave up on email. Even w/o spam it's more hassle than I like.
Easy way to handle spam... (Score:2)

by NotQuiteReal ( 608241 ) writes:

Just route everything from Facebook, LinkedIn, my dad, Apple and "i*" to the spam folder, and most of it is covered.
Problem spotted. (Score:5, Insightful)

by girlintraining ( 1395911 ) writes: on Friday August 30, 2013 @10:15PM (#44721735)

The problem is that you're using egrep in the first place. Here's the thing -- the overwhelming majority of your cycles are getting sucked loading, initializing, executing, then unloading, that thread. It's not that using regular expressions is processor-intensive... it's that repeatedly launching the same executable is.
Use something that can load once, read in the patterns, check all the e-mails that are queued, sort them, then exit. Your execution time will go from 15 seconds to 150 milliseconds.

Share
twitter facebook
- Re:Problem spotted. (Score:4, Interesting)
  
  by complete loony ( 663508 ) writes: <Jeremy.Lakeman@g ... .com minus punct> on Friday August 30, 2013 @10:31PM (#44721793)
  
  If you have sufficient programming experience, I'd recommend basing this solution on redgrep [google.com]. It's an llvm based expression compiler that should be able to combine multiple expressions into a single machine code state machine, assuming it doesn't run out of memory in the process. With a bit of effort you could output all of your compiled expressions into a single executable so you'll only need to wait for the compilation time when you add more filters.
  
  Parent Share
  twitter facebook
- Re: (Score:2)
  
  by arth1 ( 260657 ) writes:
  
  You mean like doing an egrep +F instead of multiple egreps? I sure hope he already does.
Procmail is a fine tool -- but the wrong tool (Score:5, Informative)

by Arrogant-Bastard ( 141720 ) writes: on Friday August 30, 2013 @10:33PM (#44721801)

If spam has made it far enough that it's actually reached your personal instance of procmail, then there's been a problem earlier in the chain. Procmail rulesets should be a last resort, and they should only be asked to deal with minor issues that aren't dealt with via earlier rulesets.

The first line of defense are your perimeter routers. They should implement BCP 38, they should block bogons, and they should bidirectionally deny all traffic to/from the Spamhaus DROP list. In addition, they should block inbound port 25 traffic from everywhere on the planet that you don't need email from. In other words; the fact that someone in country X wants to email you is unimportant unless you actually wish to receive mail from them. Yes, this is a reversal of default-permit, for a simple reason: default-permit for SMTP stopped being reasonable around 2000. Use http://www.ipdeny.com/ [ipdeny.com] to pick up the ranges per-country and only permit what you need. (Obviously a major research university can't do this. But Joe's Furniture, which does not have customers in Peru or Pakistan or Greece, can.)

Then use blacklists, the best defense against spam we've ever developed. (Source: 30+ years of email experience) Spamhaus's Zen blacklist is a good one with a low FP rate and a tolerable FN rate. Augment these with local blacklists based on domains and network allocations. Augment those with as much blocking of generic hostnames and dynamic IP space as possible: real mail servers have real hostnames and are on static addresses.

Then enforce RFC requirements: sending host must have rDNS, that PTR must resolve, what it resolves to should be the sending host's IP. Sending host must HELO as FQDN or bracketed dotted-quad; if FQDN, must resolve. Sending host must not send traffic pre-greeting. And so on. Enforcing these DOES mean occasionally you block mail sent by non-spamming entities: but since they are incompetent non-spamming entities, why would you want mail from them?

Add greylisting. It'll handle a lot of annoying hosts that haven't learned to retry yet.

Rate-limit based on normative values for your site. For example: if analysis of a year's worth of mail logs shows that during that time you never received more than 10 messages a day from ANY host, then rate-limit at 30 or 40. You'll never hit in normal practice; but if you get hammered by a fast-sending host, you'll blunt the attack. Note that these don't have to be perfect to work: provided you send deferrals (SMTP response codes 4xx) instead of refusals (5xx) the worst that happens is that you will mistakenly impose a delay.

There's more -- it's possible to get quite crafty about this. But note that NONE of these measures pay any attention to content. There's a reason for that: spammers can defeat content-based measures at will. They won't have it so easy with these.

Deployed in production in various setups ranging from a dozen to eight million users, these steps yield a FP rate of about 10e-6 to 10e-7 and a FN rate around 10e-5 to 10e-6. Tuning helps, of course: initial rates can be higher but log analysis (which all sensible postmasters do) readily brings them down. If you have the luxury of running your own mail server just for yourself, then you can REALLY tune this setup: you should be able to get the FN rate down to 10e-7 after a few months.

Share
twitter facebook
- Re: (Score:2)
  
  by thegarbz ( 1787294 ) writes:
  
  That's a very informative post, but the first part is making a big assumption that someone has that level of control over the network. Much of what you say is exactly the type of filtering that is applied by spamassassin and other various tools at the end end of the chain. Many of us don't have the option to work higher up the chain.
regular expression optimiser (Score:3)

by lkcl ( 517947 ) writes: <lkcl@lkcl.net> on Friday August 30, 2013 @11:09PM (#44721937) Homepage

i'd be interested to see what happens if you run those regex's through this:
http://bisqwit.iki.fi/source/regexopt.html [bisqwit.iki.fi]
btw can we please get a copy of the patterns you're using? i think they might prove useful for other people. also i'd like to test them myself against regexopt.
oh - to the other person who suggested spamassassin? i tried that, i set it up to run at MTA-time. it often took THIRTY SECONDS to process a message. in fact it was so bad that i was forced to set a limit of 100k on incoming messages, as a lot of virus-ridden word documents (etc) were typically over 100k. that cut down the amount of CPU cycles but it was still far far too much memory and far too CPU intensive.
the one thing that did work well is greylisting, however the problem with greylisting i find is that if you happen not to be at the computer or have direct access to the server and people on the phone say "i'm sending you a message now, have you got it?" you *know* it's going to be at least an hour before it'll arrive. so, unless you can whitelist them in advance (which you can't always do) greylisting does actually interfere with legitimate business.
anyway: in the end i gave up and went to gmail, but with gmail fucking up how they're doing things i have to revisit this and set up a mail server again. thus we come full circle...

Share
twitter facebook
- Re: (Score:2)
  
  by thegarbz ( 1787294 ) writes:
  
  oh - to the other person who suggested spamassassin? i tried that, i set it up to run at MTA-time. it often took THIRTY SECONDS to process a message. in fact it was so bad that i was forced to set a limit of 100k on incoming messages, as a lot of virus-ridden word documents (etc) were typically over 100k.
  I'm sorry to say it but you must be doing something wrong. I have a very default installation of spamassassin and sendmail also running at MTA time and on my really crappy old spare parts server it never takes more than a second or two to process a mail item. This also does not appear to vary depending on email size, a 10MB email seems to take just as long as a plain text one. I don't think out of the box spam-assassin checks attachments or any type of external content.
  - Re: (Score:2)
    
    by lkcl ( 517947 ) writes:
    
    thanks thegarbz - i didn't mention that i added in pyzor and razor, and i think clamav as well. also as my domain's been up for a while it does receive a considerable amount of spam. the load just got to be too much. i'll investigate alternatives and also bear in mind that spamassassin worked well for you.
Use perl (Score:3)

by Forever Wondering ( 2506940 ) writes: on Saturday August 31, 2013 @01:13AM (#44722301)

A long time ago I benchmarked perl's regex engine against about 5 others. At the time, it was 10x faster than the nearest competitor for the same regex/data.
Also, you can use perl's "study". Or, split the regexes across threads.
Also, with perl you can do some hierarchical saviings. For example:
/Ffoo/ ...
/Fbar/ ...
/Fbaz/ ...
Could be redone as:
if (/F/) {
... if (/Ffoo/)
... if (/Fbar/
... if (/Fbaz/)
}
The above is trivial example, but you get the idea.
Also, how much time is spent compiling (vs. executing) the regexes in egrep? I imagine a lot and you have to do this for each incoming message.
Note that spamassassin (and hence perl) can be set up as a daemon where the regexes are compiled once. The messages are passed through a socket to the daemon. This means that the only CPU time spent is on executing the regexes--a considerable savings.
Additionally, perl regexes have [considerably] more functionality/utility than egrep ones. You might be able to recode/consolidate yours and get the same [or better] bang for less buck.

Share
twitter facebook
Buy a domain (Score:2)

by postglock ( 917809 ) writes:

I don't even use spam blockers. Instead I've purchased a domain, which is quite affordable nowadays. I have a catch-all redirect, so I any mail addressed to *@mydomain.com.
Then, I give a unique username to each organisation. e.g. slashdot@mydomain.com. If I receive spam at this address, I inform them, then kill the username. I can also just create slashdot2@mydomain.com if I want to keep dealing with their company.
Now, I receive only a few spam emails each year, so I need to do zero automated filtering. I a
- Re: (Score:2)
  
  by jon3k ( 691256 ) writes:
  
  You can also do this via gmail. Gmail will accept and deliver email to +@gmail.com and delivery it to you. Try it out.
  
  So anytime you sign up for something, just use: postglock+slashdot@gmail.com. Then if you get spam, just look at the "To:" address, you can even write a filter based on the + sign in the "To:" field, if you wanted.
  - Re: (Score:2)
    
    by jon3k ( 691256 ) writes:
    
    sorry, replying to myself. First line should be been any email delivered to: name+any_string@gmail.com
    - Re: (Score:2)
      
      by postglock ( 917809 ) writes:
      
      I did hear about this, but I hadn't thought about writing a filter after receiving spam. That's a cool idea.
      The only part that makes me slightly wary is that since so many use gmail, you'd think that spammers would automatically remove the +slashdot part pretty soon.
      - Re: (Score:2)
        
        by jon3k ( 691256 ) writes:
        
        Entirely possible - but here's something cool. I have Google for Your Domain setup for a personal domain. I just tested it, and I was able to send an email to: jon+test@[mydomain].com. Now there's no way for a spammer to know if Google is handling my mail (easily) so they'd have to assume that the + was a legitimate character. I mean, in theory, they could lookup the MX records and if they point to google, strip the +[characters up to]@ off, but I seriously doubt many, if any at all, would do this.
Pre-compiled regex. (Score:2)

by viperidaenz ( 2515578 ) writes:

A project I worked on many years ago re-wrote a monitoring system in Java.
It was Perl, running a rather large list of regex's over syslog files.
The process of converting it to Java resulted in a 100x speed up - despite Perl possibly having a faster regex implementation. The regular expressions are compiled once on start-up. Regular expressions can be very fast - they're just slow to parse and compile.
- Re: (Score:2)
  
  by zippthorne ( 748122 ) writes:
  
  If you're going to leave the process running, perl can compile the regexes ahead of time, too...
Comparison of blacklists (Score:2)

by CBravo ( 35450 ) writes:

There is a comparison of blacklists: http://dnsbl.inps.de/analyse.cgi?type=monthly&lang=en [dnsbl.inps.de]
You get that much mail? (Score:2)

by Mysticalfruit ( 533341 ) writes:

You get so much mail so furiously that you can't suffer a 15 second delay? I presume you're talking about a personal mail server... if you're hosting mail for a 1000 people then yeah that's a problem.
- Re: (Score:2)
  
  by jon3k ( 691256 ) writes:
  
  If it's running for 15 seconds maybe it's just putting an annoyingly high load on the server. Also consider that for every legitimate mail, you could be getting a lot of spam. I know I would be annoyed if my CPU load shot up randomly ever 5 or 10 minutes when a piece of spam came in.
- Re: (Score:2)
  
  by jonbryce ( 703250 ) writes:
  
  I get around 500 emails per day to my mail server of which maybe one or two are legitimate. A 15 second delay means a maximum theoretical capacity of 5760 emails per day before emails arrive at the server faster than the spam filter can process them. Even lower overall numbers will cause substantial bottlenecks at busy times of the day.
Gmail forwarder (Score:2)

by flyingfsck ( 986395 ) writes:

Just forward your mail through gmail. That way all the spam disappears and the NSA can get their data without trouble.
The collective, barking up the wrong tree together (Score:2)

by damn_registrars ( 1103043 ) writes:

We see people complaining about this problem a lot, and yet for some reason they are afraid to actually put energy into a real solution. Repeat after me : filters can never end spam. That's right, never. All your filters (same can be said for every filter, everywhere) do is encourage the spammers to make their spam more obfuscated to improve their odds of passing future filters. It is a huge waste of time and resources and it's an arms race that the spammers will win.

If you want to actually end spam,
- Re:Or... (Score:4, Insightful)
  
  by asmkm22 ( 1902712 ) writes: on Friday August 30, 2013 @09:33PM (#44721559)
  
  Which pretty much defeats the whole point of hosting your own email...
  
  Parent Share
  twitter facebook
- Re: (Score:2)
  
  by FridayBob ( 619244 ) writes:
  
  And allow Google and the US government to scan all of my mail? No thanks. The same goes for Hotmail, Yahoo and any other commercial email service provider (and certainly those based in the US). These days it makes more sense than ever to maintain your own MTA.
  - Re: (Score:2)
    
    by vux984 ( 928602 ) writes:
    
    And allow Google and the US government to scan all of my mail?
    Well, routing it outside of gmail keep gmails hands off it, well the half that doesn't originate with them to start with. But i think you'll need to do quite a bit more to keep the us govt out of it.
    - Re: (Score:2)
      
      by FridayBob ( 619244 ) writes:
      
      Well, routing it outside of gmail keep gmails hands off it, well the half that doesn't originate with them to start with. But i think you'll need to do quite a bit more to keep the us govt out of it.
      Of course you're right about that, especially if I correspond with someone with e.g. a gmail account. But, why make it any easier for the NSA? It gets a harder for them to listen in when I correspond with people who also maintain their own email servers, and whole a lot harder when those servers can also do automatic encryption. If what they're doing is unconstitutional and a violation of your privacy, why make life easy for them and knowingly play into their hands?
  - Re: (Score:2)
    
    by bmo ( 77928 ) writes:
    
    And allow Google and the US government to scan all of my mail?
    Where have you been for the past (looks at calendar) 25 years?
    The government archives every bit of email (and generic net traffic, for that matter) it can get its hands on. Remember Echelon? Remember Total Information Awareness? Snowden's "revelation" was something that everyone, who's paid attention for a few decades, already knew or assumed. The cost of a gigabyte is 3.5 cents retail these days. That's a lot of mail archiving for not a lot
- Re: (Score:2)
  
  by macraig ( 621737 ) writes:
  
  Gmail's spam detection is spectacular.
  Are you new here? Its false positives are equally spectacular. Some days I could swear there's some mean bored Google sysadmin who happens to be a less-than-friendly coworker from my past who's just sitting there randomly applying the Spam label to messages in my Inbox.
- - Re:Or... (Score:5, Informative)
    
    by bmo ( 77928 ) writes: on Friday August 30, 2013 @09:35PM (#44721573)
    
    Well, the OP wasn't exactly clear if it was just his personal account or whether it's a corporate server. My "least amount of work" thing is to forward every email address I have to gmail, and pull mail from there via imap. I get a few hundred spams a day just on one mail account, and I haven't lost any real mail due to Gmail's filtering.
    >postini
    You do realize that is being EOLed, yes?
    http://postini-transition.googleapps.com/ [googleapps.com]
    >why gmail filters better than postini
    Probably separate spam databases. Stuff like that happens. Gmail probably gets orders of magnitude more spam to "teach" the system.
    YMMV.
    Using grep and procmail is the stone-knives-and-bearskins approach to filtering. There are a lot of other filtering systems that will be much more efficient on Unix systems. He can begin by using greylisting to filter out the non-compliant "fire and forget" spambots and then filter the winnowed pile o' crap. At least greylisting's not server intensive (it throws the load back to the sender) since 5xx and 4xx errors are cheap.
    Also blocking mail from dynamic IPs is a good idea.
    At this point he can then run the mail through a series of weighted RBLs. Reach a certain score and it's tossed. That's the processor intensive bit, but it's at the end and non-intensive filtering has already happened.
    --
    BMO
    
    Parent Share
    twitter facebook
    - Re:Or... (Score:5, Interesting)
      
      by dbIII ( 701233 ) writes: on Friday August 30, 2013 @10:37PM (#44721821)
      
      and I haven't lost any real mail due to Gmail's filtering.
      Email from people at one site I look after used to vanish into a black hole at gmail until I convinced them to replace the GIF of their corporate logo attached to all their emails with a PNG version. That's some real mail lost due to gmail's filtering.
      
      IMHO it's better to do the filtering somewhere where you have access to the stuff that is discarded. False positives may be rare now but they still happen. That's why I like stuff such as MailScanner (open source wrapper for spamassassin+your choice of commercial antivirus and/or clamav+other open source stuff+distributed updating rulesets) run on site. There's plenty of others that give you this function including some of the commercial "appliances" and outsourced email filtering.
      Also blocking mail from dynamic IPs is a good idea.
      It used to be the case that one IP address I have a mail server on would get blocked for a couple of days every year because some idiot at a blacklist would load in an obsolete list of dynamic IP addresses from what is now a decade ago. As IPv4 addresses diminish expect the lists of dynamic addresses to become outdated very quickly.
      
      Parent Share
      twitter facebook
      - Re: (Score:2)
        
        by dbIII ( 701233 ) writes:
        
        OK then - maybe dumped them into the recipient's spam folder, all I know is the recipients never became aware of the emails until the GIF files were changed to PNG. Maybe gmail policy has changed since or maybe they only said they looked in their spam folders but never actually did. Either way the emails were not getting read if they had a GIF on them.
        How do you think those IPs ended up on the list in the first place
        All your insulting bets are wrong, all that happened is some IP addresses got reassigned q
    - Re: (Score:3)
      
      by pongo000 ( 97357 ) writes:
      
      At this point he can then run the mail through a series of weighted RBLs.
      Fuck you and your RBLs. RBLs are a draconian solution that do immeasurable damage to those of us who (1) aren't spammers, and (2) choose to run our own mailservers on business-class IPs. I can't tell you how many times various IPs I use for outbound mail (I run several mailing lists) end up on an RBL for absolutely no fucking reason.
      Oh, because someone in the same /24 block sent spam? Really? That's a good reason to block an entir
      - Re: (Score:2)
        
        by jgrahn ( 181062 ) writes:
        
        Fuck you and your RBLs. RBLs are a draconian solution that do immeasurable damage to those of us who (1) aren't spammers, and (2) choose to run our own mailservers on business-class IPs. [...] Oh, because someone in the same /24 block sent spam? Really? That's a good reason to block an entire /24 subnet?
        That sucks. And more generally, I believe the anti-spam fundamentalists have caused as much damage to the use of reliable mail, as the actual spammers. They are much too willing to accept collateral damage. Sad, when you consider how much effort went into designing (a) guaranteed delivery or (b) guaranteed notification of non-delivery.
        For this particular situation: I accepted early on that many misconfigured mail servers "work" like you describe, so I make sure that my ISPs lets me relay via them.
      - Re: (Score:2)
        
        by CBravo ( 35450 ) writes:
        
        I send many millions of emails a week for mailing lists. I never get blacklistings 'for nothing'. What is annoying about blacklists is that they see so many spammers that they fail to see that normal people make mistakes (a lot). Since it is 'one strike and you are out' it is difficult to fix a situation for a legitimate situation.
        
        Example: A medium size company mails its 2000 customers, this time in Finland, and we end up with a blacklisting. This customer comes up with a new list: The opt-outs and boun
      - Re: (Score:2)
        
        by jonbryce ( 703250 ) writes:
        
        RBLs block around 95% of my incoming mail with very minimal false positives. Sorry if you don't like them, but people use RBLs because they work.
      - Re: (Score:2)
        
        by bmo ( 77928 ) writes:
        
        Fuck you and your RBLs.
        Not every RBL is created equal. Anyone with half a brain knows which ones are good and which ones are so-so. The trick is to /weight/ them. Give a smaller score to the RBL you think makes mistakes.
        Go take your blind hate elsewhere.
        --
        BMO
        
        Re: (Score:2)
        
        by pongo000 ( 97357 ) writes:
        
        I apologize for the personal attack. Not sure why I did that. Guess I let my emotions get the better of me.
        Still, fuck RBLs. Sadly, many who should know better do not weight RBLs, and instead outright reject any mail that scores a hit. These operators are slowly destroying the email infrastructure by not only fragmenting and marginalizing the smaller email providers (including individuals who choose to responsibly run their own SMTP service), but by implicitly forcing individuals to seek mail services t
- - Re: (Score:2)
    
    by YukariHirai ( 2674609 ) writes:
    
    Is it? I've never had a false positive in all the years I've been using GMail.
    - Re:Or... (Score:4, Interesting)
      
      by Count Fenring ( 669457 ) writes: on Friday August 30, 2013 @10:08PM (#44721711) Homepage Journal
      
      Is it? I've never had a false positive in all the years I've been using GMail.
      That you noticed. There's a fairly high bias inherent there; it just has to not have hit something that was both noticeable and that you knew was incoming.
      
      Parent Share
      twitter facebook
      - Re: (Score:2)
        
        by Nemyst ( 1383049 ) writes:
        
        In years of using the service I've never once been told that I'd missed an email by anyone I know. I'll take that as a sufficient confirmation that I haven't missed anything important. Things that I have missed which people haven't followed up on or notified me of are just about as good as junk mail anyway.
        
        Re: (Score:3)
        
        by CaptQuark ( 2706165 ) writes:
        
        Last night I sent my Gmail account an email from my ISP email system, then waited for it to show up. Nothing. So I resent it. Second time nothing.
        
        The email contained two screen captures I needed at the office. The subject line was "Steve on telework". Nothing obvious that would trip Postini's spam filter. It is now 24 hours later and neither has shown up. I wonder how many other emails I don't get.
        
        ~~
    - Re: (Score:2)
      
      by jonbryce ( 703250 ) writes:
      
      I got a lot of them. Mostly emails from things like newspapers that I specifically asked them to send me.
  - Re: (Score:2)
    
    by jon3k ( 691256 ) writes:
    
    Bullshit.
- Re: (Score:2)
  
  by CanadianMacFan ( 1900244 ) writes:
  
  You might also want to look at how patterns are added to the file too. If they are added to the end then the latest spam of the day message will need to parse all of the patterns until it hits the latest pattern. Of course ideally you might want to set something up that looks at the hits each pattern gets so that you could parse the most likely patterns first followed by the latest patterns.
- Re: (Score:2)
  
  by sumdumass ( 711423 ) writes:
  
  I second grey listing and using spamhause filter lists.
  It is well worth it. I had a 30 user environment receiving about 100 to 300 spam emails a day per user go to approximately 10 a piece making it through. Then when I activated the spam filtering it went to about 10 a week for about 1/3 of the users. The biggest problem is third party user machines being compromised and the spam being sent through their internet's email servers (grey listing doesn't stop legitimate servers and most big ISP's don't make it
  - Re: (Score:2)
    
    by dbIII ( 701233 ) writes:
    
    Yes, used greylisting for a couple of days and now am well aware of an inherent flaw that could cost the people who use it their jobs. Consider how it works and then consider that people at the top of organisations like to think of email as a nearly instant communications system and really don't like it when that last minute tender they've been working on all night gets delayed for half an hour (as is typical, or several hours with insane greylisting settings I've seen used) just because the person they se
    - Re: (Score:2)
      
      by sumdumass ( 711423 ) writes:
      
      Yes, used greylisting for a couple of days and now am well aware of an inherent flaw that could cost the people who use it their jobs. Consider how it works and then consider that people at the top of organisations like to think of email as a nearly instant communications system and really don't like it when that last minute tender they've been working on all night gets delayed for half an hour (as is typical, or several hours with insane greylisting settings I've seen used) just because the person they sen
      - Re: (Score:2)
        
        by dbIII ( 701233 ) writes:
        
        Please stop roleplaying someone stupid with your current game of presenting the incorrect suggestion that the greylisting time set on the recieving server doesn't matter. You're misleading the newbies that haven't worked out from your handle and posting history that you like to pretend to be dumb as shit and offer utterly stupid suggestions as if they are viable. People playing games like yours make it difficult to have an honest and factual discussion in this place.
        
        Re: (Score:2)
        
        by sumdumass ( 711423 ) writes:
        
        Please stop roleplaying someone stupid with your current game of presenting the incorrect suggestion that the greylisting time set on the recieving server doesn't matter.
        Lol.. Even if the particular grey listing you are using allows you to set the interval between reconnection attempts, you do not need to do so for grey listing to work. 90% or better spam email sent will never reconnect after being dropped once. The legitimate SMTPs will reconnect in a couple minutes if it is legit and the email will be rec
        
        Re: (Score:2)
        
        by nabsltd ( 1313397 ) writes:
        
        Please stop roleplaying someone stupid with your current game of presenting the incorrect suggestion that the greylisting time set on the recieving server doesn't matter.
        Unless you are insanely stupid and use a greylist wait of a couple of hours, it really doesn't matter. Typical values for the wait are in the sub-5 minute range. Since it's resource intensive to retry at intervals smaller than about 5 minutes, good client configurations aren't going to even get to the first retry until after the greylist wait has expired. But, based on my own logs, you could set the wait to as little as 10 seconds and still have an extremely effective front line defense against spam.
        Once
        
        Re: (Score:2)
        
        by dbIII ( 701233 ) writes:
        
        Now that's an interesting reading comprehension failure and blaming victims instead of perpetrators. The problems with greylisting impact the sender more than those that have implemented greylisting. An effective setup requires someone who at least knows as much that could be picked up from a quick read of the wikipedia article instead of the misleading rubbish a few posts above. Quick retries don't save you from idiots that tune their settings to reduce spam but forget the entire point of email, it just
        
        What is it with blaming the observer? (Score:2)
        
        by dbIII ( 701233 ) writes:
        
        So, this might be flamebait, but what it really comes down to is that your issues with greylisting were likely because you didn't do your homework
        No - I did my homework to find out exactly how somebody managed to fuck up communication and greatly delay messages from one end, and found that the answer was greylisting implemented very poorly at the remote end. My comment above is because I "did my homework" and observed the downsides. Those downsides are now listed in the wikipedia article.
        For the record of
      - Go to the definition instead of tricksters (Score:2)
        
        by dbIII ( 701233 ) writes:
        
        http://en.wikipedia.org/wiki/Greylisting [wikipedia.org]
- Re: (Score:2)
  
  by gringer ( 252588 ) writes:
  
  there are million dollar companies that can detect it faster and even better than your OSS bullshit half assed script for free
  The NSA, for example. Use a US server as your email service provider, and you get filtering for free!
- Re: (Score:2)
  
  by mishehu ( 712452 ) writes:
  
  If I had mod points I'd have given you a +1 as well too. I've been using dspam for my own systems as well as clients' systems for years now, with MySQL as the backend (InnoDB tables though, not MyISAM). The only downside is that it can end up eating a fair amount of filesize, but it's extremely fast and highly accurate. Combine that with other methods like RBL, spf checks, dk, etc., and I get but a false-positive once every 3 months or more, and a false-negative once every 6-12 months.
  SH

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

spamassassin (Score:5, Insightful)

Re: (Score:3)

Re:spamassassin (Score:5, Informative)

Re:spamassassin (Score:5, Insightful)

Re: (Score:3)

Re:spamassassin (Score:5, Informative)

Re: (Score:2)

Re: (Score:2)

Re:spamassassin (Score:5, Informative)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

O Hai. Has this been posted? (Score:2)

Re: (Score:2)

Re: (Score:2)

You could speed up your current solution (Score:5, Interesting)

Re:You could speed up your current solution (Score:5, Informative)

Re: (Score:2)

Re: (Score:2)

Current solution is awful (Score:2)

Database? (Score:3, Insightful)

Re: (Score:2)

Re: (Score:2)

bogofilter (Score:5, Informative)

Re: (Score:2)

What a forking awful solution... (Score:2, Informative)

Distribute (Score:2)

Re: (Score:2)

Re: (Score:2)

ragel (Score:2, Interesting)

DIY (Score:2)

perl or python or whatever (Score:2)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Matching multiple simultaneous regular expressions (Score:3)

Sqlite will be awesome (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

The simpler solution is ... (Score:2)

Easy way to handle spam... (Score:2)

Problem spotted. (Score:5, Insightful)

Re:Problem spotted. (Score:4, Interesting)

Re: (Score:2)

Procmail is a fine tool -- but the wrong tool (Score:5, Informative)

Re: (Score:2)

regular expression optimiser (Score:3)

Re: (Score:2)

Re: (Score:2)

Use perl (Score:3)

Buy a domain (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Pre-compiled regex. (Score:2)

Re: (Score:2)

Comparison of blacklists (Score:2)

You get that much mail? (Score:2)

Re: (Score:2)

Re: (Score:2)

Gmail forwarder (Score:2)

The collective, barking up the wrong tree together (Score:2)

Re:Or... (Score:4, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re:Or... (Score:5, Informative)

Re:Or... (Score:5, Interesting)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)