Bayesian Filter Testing? 127
pu33y asks: "Since the publication of Paul Graham's A Plan For Spam, several programs that perform Bayesian filtering having become available, including CRM114 and Bogofilter. But missing is any serious testing to see how they perform in relation to themselves and to other, non-Bayesian filters.Searching Google has turned up nothing and when I asked Paul Graham, he was unaware of any such testing, as well. Can anyone point to any such testing or provide the results of their own personal experiences with Bayesian filters?"
DSpam (Score:4, Interesting)
Some impressive stats were posted to the mailing list.
It's main feature is that it's completely maintainance free, and that even dumb people can use it (I know, I am).
My personnal stats are 2 false positives actually (one from PayPal, one from a company I work with), 280 spams learnt (I told it they were spam), 2877 spam catched and 4354 innocent.
Re:DSpam (Score:2)
My experience is that the bayesian filtering is extremely effective, far better than any other spam filtering i had tried before and far better than spam assasin before bayesian filtering was added.
I was using spam assassin before bayesian filtering was available and i found that while it had been mostly effective, it was becoming less and less so even while i kept up with software upgrades. It was not uncommon for 5-10 spam mail to get through pe
Serious testing?? (Score:1)
Re:Serious testing?? (Score:2)
Re:Serious testing?? (Score:2)
Absofrigginlutely.
Mozilla Mail & News is watching over my mail, including upwards of a dozen mailing lists and works almost flawlessly. Especially good is the fact that I access my mail via IMAP from as many as six different Mozilla clients in various locations, and at this point they're all trained in my e-mail habits.
It took longer for me to train it, due to the fact that I'd previously kept my address(es) close to my chest, so my SPAM intake was perhaps 2-3 mess
Re:Serious testing?? (Score:1)
Online repository needed (Score:5, Interesting)
Something like say, the UCI Machine Learning Repository [uci.edu]. In fact, look at the UCI spambase [uci.edu]A couple of problems with the UCI spambase. Too old / out of date. And too small.
I looks like there is a more recent community effort going on over a SpamArchive [spamarchive.org]
Looks like you should have googled [google.com].
Re:Online repository needed (Score:1)
Re:Online repository needed (Score:3, Insightful)
The SpamAssassin people have talked about this in the past. They have a corpus of sp
Re:Online repository needed (Score:2)
Please tell me you meant urologist. I don't wanna see the proctologist who gets those kinds of emails.
Re:Online repository needed (Score:1)
(Repeat after me, don't post while at work...)
Re:Online repository needed (Score:1)
The original question was regarding testing to see how they perform in relation to themselves and to other, non-Bayesian filters. So while it is of course best for you to test all of the different spam filters with your spam, it is not as practical as having each developer test their own spam filter again a common, known spam
SA Public Corpus (Score:1)
There is one, for exactly this reason -- the SpamAssassin public corpus [spamassassin.org]. I made it available for developers of spam tools to compare effectiveness using a good, recent corpus from 1 person's mail feed (as much as that was possible).
Here's the pertinent part of the README [spamassassin.org]:
Ella: OpenField Software (Score:2, Interesting)
I have had it for about 2 weeks. In the last 3 days I have had 2 false +'s (messge in Spam that shouldn't be there) and 4 that went to the newsletter folder that shouldn't have.
The good think about these tools (Score:3, Informative)
1) Gives the user the idea that he can improve the situation by doing some concrete action. Controlling future spams is not upon some guru releasing a better filter or him hacking some better rules.
2) By definition, works better and better the more spam you get (and mark it as spam). Even poor tools will eventually detect spam since it's obvious to anyone reading spam, that those messages tend to repeat and to be similar.
3) It's automagically customized to your own spam. If you live in Germany, Sweden, Argentina or Namibia you will catch easily any spam that is in English, and you will build up rules for the local spam that arrives in your language.
4) In the case or Mozilla's MailNews, it's so easy to use, intuitive and straighforward, any user will use it.
5) Makes you feel spams are useful for something: detecting future spams.
I think those advantages are far more important that the rate of effetivity.
Re:The good think about these tools (Score:1)
Man, I never thought I'd agree that spam is good for anything, but I do wholeheartedly agree. I actually enjoy watching it go through it's paces, moving and marking mail as spam. makes me feel as if I'm accomplishing something.
I also understand I possibly need to get out of the house more.
Ja rulez (Score:1)
Personally, my white list and non-baysian rules eliminate 99.9% of the crap and abuse. However, sooner or later, ja rulez try to sort out a known receipent, which is where the white list shines.
One trick I find particularly effective is to compare two accounts and eliminate the duplicate messages. The othe
Re:Ja rulez (Score:3, Informative)
With Mozilla, you get the best of both worlds. You've got Bayesian filtering with an optional whitelist component. You can select any of your address books as the source of your whitelist (default is "Personal Addresses"), so any of your friends can send you all the SPAM the
Spambayes!!!! (Score:4, Informative)
I get about 150 spams a day, and about 5 hams. Spambayes might classify 1 spam as "unsure" and the rest as spam. The ham is always classified as ham.
My corpus is about 5000 spams, about 1000 hams. Get spambayes -- it's open source and it really works great.
Re:Spambayes!!!! (Score:2)
I'm sold...but wait, it's free!
Re:Spambayes!!!! (Score:1)
I haven't tested this against other filter programs but I'm not planning to at this point. I told my boss I'd test it for a month but after 1 week I'm already recommending it.
Thomas Bayes is my new favorite dead guy. I put a poster of Thomas Bayes up in my office and added the phrase "Spam Killer"
Hey everyone... (Score:4, Informative)
But missing is any serious testing to see how they perform in relation to themselves and to other, non-Bayesian filters.
Despite the call for your experiences, if you just want to post "X rocks!", I think the poster was looking more for "X rocks more then Y!", where both X and Y are Bayes-type filter programs. I don't think he was asking for just announcements that Bayes rocks; I think he or she already knows that.
I mention this because I'd be interested in some comparisions too; there's a lot of sub-techniques out there. Are there any real differences, or are they all effectively the same? The latter would strongly indicate that there may not be any real progress to be made, if the entire space of Bayes-type solutions has flat effectiveness, for instance. It's an interesting question.
POPFile rocks more than spambayes (Score:1)
I'm a very happy POPFile user that keeps checking out spambayes because the math sounds interesting.
spambayes has become quite good, but POPFile is phenomenal. Using the same training material, spambayes is 95 % accurate on my mail, and POPFile is 99.5 % accurate. Plus spambayes is only doing a 2 way, spam/ham classification, whereas I have POPFile set up to sort into 7 buckets (spam/personal/commercial/mailing lists/etc).
Though irrelev
Spambayes UI (Score:2)
While this is theoretically good design, especially in the open source community, it does often result in Some Shmoe creating the UI who should stick to coding sysadmin scripts.
Re:POPFile rocks more than spambayes (Score:1)
Mozilla's Junk-mail Filters (Score:3, Informative)
I think that one of the best things about Mozilla's system is that it's in the client, on my machine and under my control. While server-side solutions, distributed corpus tools, etc. might be more accurate, not ever having to install or update any 3rd-party apps is really nice.
--Asa
Ling Spam Corpus (Score:3, Informative)
Not Just for SPAM (Score:4, Insightful)
I figure, if the mail can be classified into many different categories, why not use bayesian filtering for managing all your filtering needs.
It would be very valuable to have the bayesian filter learn what kind of mail I put in some folders, so that when my mail comes it, it can auto-sort it into the appropriate folder for me. Trouble is, all the current implementations of Bayesian email filtering are a single test SPAM/NOTSPAM. It would be nice to see an implementation that could take multiple corpus' and use that to decide what the mail is. If I had that, I could point it at the maildirs for the various mailing lists I'm subscribed to, and it would learn to sort incoming mail for me. *sigh*
Re:Not Just for SPAM (Score:4, Informative)
Re:Not Just for SPAM (Score:1)
It is free, it is open source, it is a general classifier that can sort your inbound e-mail into any number of user-specified categories, or "buckets".
Re:Not Just for SPAM (Score:2)
On a side note, perhaps the reason most filtering products use a spam/notspam model is because genuine mail is so easy to
BogoFilter (Score:4, Informative)
Some of the developers have done extensive testing: Greg Louis' Page [www.bgl.nu] has lots of information, comparing different bayesian approaches, different header processing, etc.
You could also read the mailing-list archives, or perhaps post some questions there [sourceforge.net].
PC mag test results (Score:1)
The latest PC Magazine [pcmagazine.com] has an article [pcmag.com] on alternative e-mail. Their Editors' Choice, Oddpost [oddpost.com] ($10/yr, free trial), uses Bayesian filters, and blocked 22 of 29 spam messages, and only legitimate e-mail ended up in their spam folder. Also worth noting is these are the results with minimal training, so, in theory Bayesian filters could quite possibly block virtually all e-mail with time.
Re:PC mag test results (Score:2, Funny)
Sounds like an ideal mail filter to me!
Re:PC mag test results (Score:1)
Re:PC mag test results (Score:1)
Try here (Score:3, Informative)
Graphs, methodology, links to more stats.
my simple filter (Score:3, Interesting)
These days, I'm on too many lists that don't filter spam, so I've had to resort to more sophisticated techniques, but someone who isn't on those sorts of lists might still find my oh-so-simple approach fairly effective. Not to disparage Bayesian filtering, but if you want something to compare against...
The 20 Newsgroups dataset (Score:1)
One good dataset is the 20 Newsgroups [mit.edu] dataset that is used by a Naive Bayes classifier called Rainbow (google for 'libbow'). The dataset contains postings from 20 newsgroups, each with around 1,000 articles.
Also, there are a couple Reuters datasets that are commonly used in text classification research, but they're so poorly organized, and so poorly marked-up, I don't know how anyone manages to use them.
the comments are missing the point... (Score:1, Informative)
it would be interesting if there were a generic test system that could be 'plugged in' to the various projects
Mozilla's Bayesian filtering works great (Score:1)
Between my two mailboxes, I receive about 100-150 spams a day. Over 90% of them are detected and are shunted into the Junk folder. Maybe 2-3 messages a month are false-positives. When it is wrong, I just teach it - click the trash button to toggle a message's junk status and Mozilla updates its filters in order to not make tha
20.000 mailboxes using, on 2% false positives (Score:1)
We also follow the amount of messages marked as "spam" and "good" by the users (more than 3 months old).
The number we get, is the one mentioned on the topic. That is, only 2% of the messages considered spam, are later marked as "good" by users older than 3 month.
Re:20.000 mailboxes using, on 2% false positives (Score:1)
I wonder what it is?
Re:20.000 mailboxes using, on 2% false positives (Score:1)
it's bogofilter
Collaborative Filtering (Score:1)
Mail app in Mac OS X... (Score:1)
Re:Mail app in Mac OS X... (Score:1)
I use Mail.app in conjunction with hotwayd to read my hotmail account. Before doing this, my hotmail account was virtually unusable, requiring me to manually delete up to 50 SPAM messages every few days. Mail.app has reduced that to maybe 5 or 6 over the same timeframe, so for me it's around 90% with very few false positives (around 1% historically, which I expect to tend towards 0%).
Based on the random looking stuff in SPAM messages, spammers