Computationally Cheap Spam Filtering? 85
"Ideally, I'd prefer something that does reject the message if it's spam (SMTP result code 550 or something like that), unlike current Spamassassin or spamprobe setups that accept the message and only later decide whether it's spam. Solutions like MAPS RBL, ORBS are acceptable altough commentary on their accuracy would be welcome. Other possibilities I've thought of include checksumming (Vipul's razor or DCC) and simple header checks that could be implemented for instance in a sendmail milter.
Are several quick checks (DCC + RBL) accurate enough and still cheaper than one slow check (Spamassassin, bayesian filtering)? does stacking of similar techniques improve accuracy significantly? (DCC + Razor, RBL + ORBS). How can the good but expensive techniques be made cheaper? (Spamassassin's spamproxyd, hashed wordlists for bayesian filters, and so on). Discussion on all these aspects would yield some interesting conclusions on quick and efficient spam filtering."
FP? (Score:2, Insightful)
Solution (Score:4, Interesting)
Re:Solution (Score:2)
Your DNS is probably hosed. (Score:2)
I recommend you examine some log files (what a concept!) and do some tests of name resolution. The timeouts you describe are typical of a mailserver with a completely b0rked DNS.
You should always run a local name resolver on a mailserver anyway, with query access limited to 127.0.0.1 (loopback) so others hosts cannot use the machine as a nameserver. That way, you can set up dummy zones fo
Hardware Virus Checker (Score:5, Informative)
Return-Path:
Received: from vt.edu (gkar.cc.vt.edu [198.82.161.196]) by xxxx.xxxx.vt.edu (8.12.8/linuxconf) with ESMTP id h47JISRm004277 for ; Wed, 7 May 2003 15:18:28 -0400
Received: from steiner.cc.vt.edu ([10.1.1.14]) by gkar.cc.vt.edu (Sun Internet Mail Server sims.3.5.2001.05.04.11.50.p10) with ESMTP id for noone@xxxx.xxxx.vt.edu; Wed, 7 May 2003 15:18:31 -0400 (EDT)
Received: from aol.com (host217-40-92-155.in-addr.btopenworld.com [217.40.92.155]) by steiner.cc.vt.edu (Mirapoint Messaging Server MOS 3.3.2-CR) with SMTP id BIE36579; Wed, 07 May 2003 15:18:17 -0400 (EDT)
Date: Thu, 08 May 2003 03:13:26 -0800
From: Kate Welsh
Subject: [SPAM] Remember me?
To: spam@vt.edu
Message-id:
MIME-version: 1.0
X-Mailer: Microsoft Outlook IMO, Build 9.0.2416 (9.0.2910.0)
Importance: Normal
X-Junkmail: UCE(58)
X-Priority: 3
X-Spam-Status: Yes, hits=13.0 required=5.0 tests=ALL_NATURAL,BASE64_ENC_TEXT,BIG_FONT,CLICK_
X-Spam-Flag: YES
X-Spam-Level: *************
X-Spam-Checker-Version: SpamAssassin 2.44 (1.115.2.24-2003-01-30-exp)
X-Spam-Prev-Content-
X-Evolution-Source: imap://jackie@localhost/
Re:Hardware Virus Checker (Score:2)
Re:Hardware Virus Checker (Score:1)
"X-Spam-Checker-Version: SpamAssassin 2.44 (1.115.2.24-2003-01-30-exp)"
Then....
Re:Hardware Virus Checker (Score:2)
Re:Hardware Virus Checker (Score:2)
Think I could get an "official" statement from MiraPoint that it uses SpamAssassin? We'd rather not wait for McAfee to get their act together...
Any contacts there I could call/email? (If you don't want to reply publicly I can be reached vi
Re:Hardware Virus Checker (Score:1)
Re:Hardware Virus Checker (Score:2)
Client side? (Score:2, Informative)
For outlook users, i recommend Spammunition [upserve.com] and I just use mozilla's spam filtering, which works great.
Client side: Eudora (Score:3, Informative)
Eudora users can use Spamnix [www.spamnix]. Works like a charm.
Re:Client side? (Score:1)
Do the authors get a special deal from M$ for not making their software work with express?
Re:Client side? (Score:2)
Very different programs in all but name.
Thanks! (Score:1)
Some Ideas (Score:2, Interesting)
What are your requirements? Do you have very limited hardware to work with? Do you need a particularly low latency for delivery? How many messages do you need to process per minute? (or per second)
If it's possible, having a seperate spam filtering box might be a good idea. If that gets loaded down you could even make a cluster of them. I'm not sure that high-level spam filtering really takes as much cpu time as you'
Another cheap way.... (Score:3, Informative)
Re:Another cheap way.... (Score:1, Funny)
A few comments... (Score:5, Informative)
Re:A few comments... (Score:3, Interesting)
Um, don't even bother. Either filter and drop the spam, or just let everything through. Having someone go through all the marked spam messages is just as wasteful as going through the unmarked ones. If you're that afraid of dropping something, conside
Re:A few comments... (Score:5, Insightful)
Make sure that people know that they can (and probably will) lose legitimate email. Make sure there's a way to bypass the filters. For example, hold the email until you can confirm the sender (reply to sender, and if your message bounces or isn't replied to in n days, delete). Let users setup their own configuration (scores, whitelists, etc), but be able to override some things (eg don't let them blacklist internal mail).
Re:A few comments... (Score:3, Interesting)
I use spambayes [slashdot.org] for my spam filtering.
I get about 50 items classified as spam per day. These have a spam probability according to my spam and ham corpuses of > 90% (usually 100%).
However, I also get 2-6 things classified as *possibly* spam each day. These are things which have a spam probability of > 15%.
These get mixed up in among 200-400 other messages each day.
Once I got things set up, I have *never* had anything classified definitely spam which wasn't.
Most of t
Re:A few comments... (Score:2, Informative)
I was wondering this myself when I set up spamassassin on our mail server here (We're a web hosting provider using Communigate Pro). It filters mail for everyone by just changing the subject line to let them know that it's been marked as spam. From
Spam Assassin and/or Popfile (Score:5, Informative)
Anyway, for our 1500 users we use SpamAssassin with RBL and blacklists and our meager server (PIII 1.26GHz with 512Mb RAM) doesn't even reach 0.20, the heuristics is turned down due to the processor usage but it filters about 90% of the spam with very little load.
I, personally, use Popfile (search Sourceforge) as my personal filter - with it's database right now, not that big, just some 8Mb with over 200,000 emails since training (from my huge spam database) and normal usage over the past year for me and a dozen other users. Very easy to set up and use, you just need to train it with a good database. It's stats state that it has a 99.85% correctness rate. The machine has reached
Re:Spam Assassin and/or Popfile (Score:2)
I set up POPFile my last week at my old job. 20 users, 1000 emails a day, 80%+ of which was spam.
PIII 1GHz, 512M of RAM running Win2K server. Load on that box despite running Exchange to boot was maybe 5-6% CPU when idling, and their accuracy only 2 months is 99.98% and we've never even reset the statistics!
-----
Another vote for Popfile... (Score:2)
Now that I'm looking to deal with all my mail from a server, I'm trying to find a way to use the Popfile filters I've so carefully trained over the last few months!
BTW, my Popfile's accuracy is also just under 99%.
Loadbalancing (Score:2)
Round robin dns'ing, a load balancing machine, a firewall that can do the likes (bigIP, yuck, i hate them).
Your question is geared towards SMTP, but it's generally a network service question and how to handle X amount of traffic with Y resources.
Cloudmark Authority - a consensus based blocker (Score:3, Informative)
I receive about 50-70 spam mails per day, and the client has been blocking 98 percent of them every day. I have been very impressed by it.
See if their server product is appropriate for you. It simply uses a consensus derived list from client users to block messages at the server. Kind of a blacklist thing.
Cuchullain
Re:Cloudmark Authority - a consensus based blocker (Score:1)
The problem is hard (Score:5, Informative)
1) Filtering spam is not trivial. A program that filters spam X% better than another program will be X^2% more complicated or worse.
2) You can't write a program that will filter perfectly. At best, all you can do is develop a set of heuristics that you hope aren't too complicated. The less complicated the heuristic, the fewer resources it will require.
3) There's a limit to how simple your heuristics can be.
4) The system of spam is not just the message: it's the spammer, plus the message, plus the recipient. This is because a certain message considered spam by some will not be considered spam by others. That means that the heuristics that account for the person reading the spam will be better than those that don't. The source of a spam is also important: a message consisting of a spam report to a spam newsgroup is not a spam, though it may contain a complete spam message.
5) The best spam filters will eventually be AI's that understand human language. That means that the ultime spam filter will require enough processing power to model human cognitive abilities. In short, you're going to see an endless increase in the number of processor cycles consumed by spam filters, asymptotically approaching the requirements of a full-up human brain simulation.
On the other hand, this will sell a hell of a lot of computers.
Re:The problem is hard (Score:2)
What an amazing claim. Could you elaborate on this? For example, how do you apply Cantor's diagonalization argument to email?
Re:The problem is hard (Score:1)
At this point, Cantor's diagonalization is trivial.
Re:The problem is hard (Score:2)
You have an interesting definition of "trivial".
Ok, suppose I have a function which I claim distinguishes with perfect accuracy between spam and non-spam. How do you propose to construct a message which it mis-identifies?
Re:The problem is hard (Score:1)
Suppose you constructed a message that said in part: 'You have an interesting definition of "trivial".' You ran this through your perfect classification function and it said "NOT SPAM".
I received the message and said to myself "well, look at that spam." Clearly the spam classifier mis-identified the spam as being ham, because I would consider it spam.
I'm not trying to duck your question. OK, well maybe I am. I spent a bit of time thinking about it, and it seems that dia
Re:The problem is hard (Score:2)
Re:The problem is hard (Score:2)
That may not be true, and it goes to the same issue of definition of spam that makes this not an example of the halting problem. Since the final definition of spam is in the eye of the beholder, only a perfect model of a particular person's cognitive processes will be able to definitively distinguish spam from non-spam
Most effective spam filtering I know of (Score:2)
time, and they catch about 95 percent of it.
Enjoy:
If From/Sender or Subject contain:
post-line, yahoo, mail, (your account name),
postforme, photos, hot, degree, earthlink, aol,
opt, cum, young, hollywood, notme, naked, penis,
bigger, usa, model, women, girl, slut, prize, won
msn, horny, dirty, gang, where, winner, price,
teen, printer
Move to folder spam.
You'll get some false positives once in a great
while, but it's nice to have all your spam in
one folder and we
Easiest Solution: (Score:1)
I've got a better solution (Score:3, Informative)
I've put together the beginnings of an alternate proposal [bitshift.org], which draws on some of the good aspects of the above approaches, without the need to rewrite SMTP. It's a community-based, peer-based approach that leaves the power in the hands of the operator. Plus, there's no profit motive (except that it's in an operator's best interest, and thus the corporate owner's best interest, to maintain his/her server's level of trust).
Here's the best: (Score:2)
Re:Here's the best: (Score:2)
One very fast check is extremely effective: (Score:3, Insightful)
(I wish I had thought of this, but Russell Nelson did.)
Bogofilter rocks! (Score:3, Informative)
Was pretty happy with spamassassin, but our mailserver was crumbling under the load.
Switched to bogofilter [sourceforge.net] and, after a training period, we're now getting better accuracy (97.6%) with spam recognition than we did with SpamAssassin, with MUCH reduced server load.
Re:Bogofilter rocks! (Score:1)
I've used (Score:2, Informative)
$300 is cheap? Pass the caviar! (Score:1)
"When I was a boy, you could get a Baby Ruth bar for a nickel, and it was as big around as your leg."
Re:$300 is cheap? Pass the caviar! (Score:1)
The "real world" sometimes requires you to actually spend money. Sometimes paying for things is cheaper than trying to piece together a bunch of other non-releated packages.
Multi level approach (Score:4, Insightful)
EHLO your.machine.ip.address
or
EHLO your.machine.name then it IS a spammer. Reject now. There are some patches and configurations for Postfix so you can declare that RCPT from certain domains like yahoo and hotmail be verified to have a hotmail EHLO that properly resolves. This is more expensive as a dns lookup is required but this will probably be cached locally pretty quickly.
You can also unceremoniously drop any connection that starts pipelining before you say it is OK to pipeline and any EHLO that has an illegal hostname.
This, at least, reduces the work your scanning engines will have to do. Still, even if you catch nearly all the spam with the easy checks you will only reduce your mail volume by ~40% (current estimated overall spam volume) so that leaves you with 60% to scan.
I suppose your main MX could do the easy checks then send the remainder off to as many round-robin scanners as necessary which in turn could pass the mail on for delivery.
One starts to realize why some places just roll over and pay tens of thousands of dollars to someone else to do it for them.
Sendmail patches / config? (Score:2)
Re:Sendmail patches / config? (Score:2, Informative)
Re:Multi level approach (Score:2)
Dropping connections like this is not a good thing since the other party (ofc. depending on the implementation) will assume that due to network problems the connection failed; resulting in a re-connect after some time-out.
This may effectively drain more resources than you were trying to save. Always send a 5xx return code (permanent error) to the server,
Excellent point (Score:2)
I would start with a static domain-based blocking scheme. It requires a bit of maintenance (I need to add 10 or so domains/week), but I reject a LOT of mail with no false positives.
Then use a more computationally intensive filter to catch what gets past the domain-based blocker. Potentially tie them together. (Have the computationally intensive checker make a list of domains. Then you can checkmark ones you want to block. I get legit mail from Yahoo users, so I can'
Run spamd/spamc version of SpamAssassin (Score:5, Interesting)
People report processing times in the range of 0.2 to 0.5 seconds per message with basic tests (no pyzor 2). Get a fast machine with dual processors, plenty of RAM, a caching DNS server, set spamd/spamc to have an appropriate number of child processes, and you should be good to go.
It's certainly going to be cheaper than the sexual harassment lawsuit that one of those 50,000 users is going to file for being forced to look at pornographic material (we require employees to read their e-mail, don't you?).
Re:Run spamd/spamc version of SpamAssassin (Score:1, Informative)
Re:Run spamd/spamc version of SpamAssassin (Score:2)
You mean in length? I have SpamAssassin processing many many multimegabyte mails without timing out...
Why not just... (Score:1)
Tarproxy (Score:2, Informative)
I can only think that commercial spam filtering companies are terrified of it, thus are somehow keeping it out of the public eye.
Answers (Score:1, Interesting)
Multiple Filters? (Score:1)
is this too complicated to implement?
Re:Multiple Filters? (Score:1)
we use it, no problem (Score:1)
amavisd-new+spamassassin+clamav (Score:2, Informative)
The frontend mail servers are running amavisd-new which is configured to use spamassassin and clamav. You can use DNS RR or just have multiple MX recs to load balance as many of these filtering servers as you need. Our filtering servers are cheap XP2100+s (w/1GB of ram) in a rack mount case that cost us ~$650 each. Amavis is just
Re:amavisd-new+spamassassin+clamav (Score:2, Informative)
Re:amavisd-new+spamassassin+clamav (Score:1)
Re:amavisd-new+spamassassin+clamav (Score:1)
Filter via proxy, not LDA (Score:4, Insightful)
At the SMTP server
At the SMTP Filter Proxy Server or LDA
Just remember to shortcut the process along the way. If email can be dropped or tagged for any reason, do so immediately and quit processing it.
Re:Filter via proxy, not LDA (Score:1)
Re:Filter via proxy, not LDA (Score:2)
Why are you passing by value instead of passing by reference?
That is to say, why are you sending an attachement instead of a URL or some other pointer to the file?
Chewie did say "... if possible". That hardly sounds insane to me.
Re:Filter via proxy, not LDA (Score:1)
user education (Score:4, Interesting)
You could take some steps on the user education side of things. Before being given an account, they should learn a few things about how to keep their address safe, like:
Also, if you're working for an organization which may want to expose user addresses to the internet via a web site, you may want to work with the web master and legal to create a click-through agreement that would stop spam harvesting robots while only requiring a couple extra clicks for the legitimate public. Or work with the web master to create a standard human-only readable way to post email addresses, e.g. "email lauren at our domain of example.com".
You may wish to register an additional domain or two to provide disposable email address services to your users.
Consider a piece of software that blocks IPs attempting to brute-force email addresses. Some filter monitoring the logs for excessive bounces from an IP and passing it to the firewall would work. I don't know of any examples of this software, but if you're doing a large email service you may get these kinds of attacks.
followup:user education (Score:2)
To my fourth paragraph: Spamgourmet is apparently a sourceforge project [sourceforge.net]
Tips for Sendmail configuration? (Score:2)
Re:Tips for Sendmail configuration? (Score:1)
INPUT_MAIL_FILTER(`spamassassin', `S=local:/var/run/spamassassin.sock, F=, T=C:15m;S:4m;R:4m;E:10m')
INPUT_MAIL_FILTER(`mimedefang', `S=unix:/var/spool/MIMEDefang/mimedefang.sock, F=T, T=S:60s;R:60s;E:5m')
define(`confPRIVACY_FLAGS', `authwarnings,novrfy,noexpn,restrictqrun')dnl
Redundancy checking (Score:1)
SMTP rejecting of spam considered harmful (Score:3, Interesting)
False positives can be more harmful than messages getting through the spam filter.
BAD (Score:2)
If it's not directly aimed at you, you DO NOT delete or reject it. PERIOD. You tag it. Maybe you even quarentine it. But you DO NOT reject it out of hand.
use decision trees (Score:2)
If you want computationally efficient methods for detecting spam, look into decision trees (search on Google for decision trees and spam filtering). If you set them up properly, they result in a sequence of simple tests like "Is this addressed to me?", "Does the subject line contain the word 'penis' or 'breast'?", etc. Like