Smart Spam Filtering For Forums and Blogs? 183
phorm writes "While filtering for spam on email and other related mediums seems to be fairly productive, there is a growing issue with spam on forums, message-boards, blogs, and other such sites. In many cases, sites use prevention methods such as captchas or question-answer values to try and restrict input to human-only visitors. However, even with such safeguards — and especially with most forms of captcha being cracked fairly often these days — it seems that spammers are becoming an increasing nuisance in this regard. While searching for plugins or extensions to spamassassin etc I have had little luck finding anything not tied into the email framework. Google searches for PHP-based spam filtering tends to come up with mostly commercial and/or more email-related filters. Does anyone know of a good system for filtering spam in general messages? Preferably such a system would be FOSS, and something with a daemon component (accessible by port or socket) to offer quick response-times."
mollom not so free (Score:3, Interesting)
mollom [mollom.com]
i discovered this one through drupal. I thought it was completely free but apparently for high traffic sites it isn't.
I think all your user generated content is sent to them and checked for spaminess against the other submissions they are receiving and they give you back a rating.
DIY or it will be broken (Score:5, Interesting)
Any method you use can be broken. Your only chance is to reduce the likelihood that your site is worth the effort.
Basically, if you use a common solution - no matter of FOSS or commercial - then there will be a thousand other sites that use it too. This attracts attackers because they know when they hack it once, they can re-use it.
However, if you handcode something, no matter how primitive, it likely lasts a lot longer because nobody bothers hacking into your site...
Of course that doesn't work if you have a large site like myspace - there, a single site is worth the effort by itself.
Anyway - then there are two things - a really fast moving animated gif and silly things where you ask people to identify items usually work.
I help out with a site that randomly takes five pictures of cats and dogs and it asks you to identify which of the images contains the highest number of kittens... We barely ever get spam through - and that with almost 20K attempted submissions by non-humans a day makes us pretty happy
Peter.
4 Tests Stopped 30,000 Comments For Me (Score:5, Interesting)
Test one is, does the last name = the first name. For some reason almost all spammers do this.
Second, do they use a keyword from a list of about 15 words.
Third, do they fill out a hidden inputbox? This is sort of the reverse captcha.
Finally do they use more than 4 "http" in a post. Almost all comment spam is an SEO effort to increase their pagerank.
HTTPBL (Score:2, Interesting)
Project Honeypot's HTTPBL has been good to me:
See: www.projecthoneypot.org/httpbl.php
Re:Second that! (Score:5, Interesting)
Fast way is to slow it down (Score:4, Interesting)
The fastest way is probably to just slow down user registration. Permit anonymous posting, but make it moderated/screened by default (ie. not visible to other users until the forum owner flags it as OK). When a user goes to register (so they can get their posts visible immediately), do not send them the confirmation e-mail immediately. Batch your confirmations up and send them out twice a day at odd times (ie. not midnight and noon, something like 3:47am and 3:47 pm) (you could do it 4 times a day, but not much faster than that since the idea's to introduce a delay in the registration process). Make sure to tell the user on the registration screen what sort of time-frame they can expect their confirmation to arrive in. Ordinary users who plan on using the forum long-term won't be inconvenienced much by this. Spammers... won't tolerate the delay, they want to get their message in fast and get out. With their automated scripts they might not even notice things are failing. Also, don't include a direct confirmation link in the e-mail. Include a URL to a form and make the user copy-and-paste the confirmation number from the e-mail. That'll be trivial for humans, but not easy for an automated script to handle without human assistance.
None of that will stop a determined spammer, but most of them are more interested in volume than anything else and they won't bother spending time/effort on just one forum when they could hit 10 others instead.
Better than Askimet? (Score:4, Interesting)
Arguably, it is Mollom. Especially if you are using Drupal.
Askimet is 'rotting on th evine' in many ways - including development updates. Mollom is a commercial web service, with a free version for non-profit and small volume sites/users.
The Drupal module is explained here:
http://drupal.org/project/mollom [drupal.org]
The Mollom site:
http://mollom.com/ [mollom.com]
Re:DIY or it will be broken (Score:5, Interesting)
However, if you handcode something, no matter how primitive, it likely lasts a lot longer because nobody bothers hacking into your site...
Simply renaming the .php files worked 100% for me.
My 3 tests also work (Score:5, Interesting)
My rules are:
1) The text boxes for things like name and subject are actually called junk.
2) There are hidden textboxes called name and subject (1 hidden by javascript and one by CSS) that if they are populated the post is ignored.
3) A third hidden field is the result of a simple javascript math equation that is checked on the server side. If the value is wrong, the post is thrown out.
As others have said, if your site is small these types of things are good enough to prevent spam because the spammers won't bother to figure it out. These concepts would never work for any of the larger sites or 3rd party forum software.
Pivot open source blogging (Score:3, Interesting)
The comment- and trackback-spam blocking techniques in Pivot blogging software are, from my limited personal experience, 100% effective. There's even an extension that uses the enormous Project Honeypot database (http:BL) to weed out IP addresses of identified harvesters and comment spammers. That's just for entertainment, though, since the basic techniques are completely effective.
How does Slashdot do it? (Score:2, Interesting)
I rarely see spam here...or is it just quickly modded down to oblivion?
Re:4 Tests Stopped 30,000 Comments For Me (Score:5, Interesting)
Background: One of my sites is a custom job which kills a spam comment every 3 seconds or so, and has done so consistently for the past four years.
OP's suggestions are very good, especially limiting the number of 'http's. We've given up on the keyword lists since they are costly to maintain and aren't as effective as some other methods.
Currently, the most effective kill rules for us are:
1) We write the client's IP address, the ID of the thing being commented on, and random stuff to a cookie from the legitimate page from which the client clicked the "post reply" link. If the IP address doesn't match, or if the ID missing, or if the parameter for the random junk aren't in the cookie, then fail. This rule traps non-browser scripts and limits spam throughput, but does not affect humans.
2) The client's IP address is a hidden form variable. If that IP address does not match the IP from which the POST originates, fail. This rule traps the browser-based scripts, and operators who proxy through botnets for testing.
These two rules catch all but about two spam-like messages a month (spam operator not using proxies to test their scripts), and have mislabeled two legitimate messages (from a local ISP's poorly-configured proxy) in the last three years.
There are other things at play, such as salted hashes of the above, and some other heuristics on hidden and unused fields which sort and categorise the spam for our own research (including point of origin, topic, etc.). One finding is that IP/geographic blacklists are ineffective. I'll post new findings and methods in another two years.
I'm also evil in that the apparent failure modes are non-deterministic, and include such things as random HTTP response codes, random modes of connection failure, and spam messages that apparently go through, but are only visible for the IP that posted them, or for one minute after they are posted.
Your move, "RosarioRush".
Re:DIY or it will be broken (Score:2, Interesting)
I'll second this.
My friend runs a smaller site and was having a problem with forum spam. He edited the registration page to include a checkbox that said something along the lines of "check this box if you are not a bot". His problems went away instantly. Obviously this does not scale well, but for smaller sites being targeted randomly by automatic spam crawlers, it appears to be very effective.
Re:the solution is here .. (Score:3, Interesting)
That's a non-issue.
Want to block a ton of spam? Reject any inbound smtp connections that have no reverse DNS record, then use regular expressions on those that do to refuse connections from dynamic/home/dsl/dial_up/etc. (I tried to post the regexes, but slashdot whined about " Lameness filter encountered. Post aborted!")
Stop talking to dynamic IPs and about 90% of the world's spam will immediately vanish.
Just disallow links........ (Score:1, Interesting)
I do something rather simple in my forums (about 30 of them) that seems to work very well: I disallow any user from posting a message with a URL in it until they've made a certain number of posts (usually about 20 or so). Until they have that many posts under their belt any post with a URL in it just returns them to a preview screen. Since their goal in life is to drop a link, this really frustrates the *&$%! out of them. :)
None of them so far has every bothered to make 20 real posts in order to get by this limitation. On the rare occasion (and I do mean rare) that they have started to post stuff in order to work towards reaching the limit, they have their posts removed by mods or admins which resets them back to zero.
Like I said, this really frustrates the *&$%! out of them. lol
At the risk of being labeled a spammer, you can see it in action here: www.grouptopic.com
Same with my contact forms- no links allowed. It just stops 'em dead, and if they REALLY need to send a link, they can contact me first and say so. Works like a charm.
Mike
Re:D.I.Y. (Score:2, Interesting)
This suggests a solution... Instead of using the web for comment submission: use SMTP.
A user who wants to submit a comment answers a captcha, and clicks a "submit" button.
An e-mail address is displayed for them to send their comment to.
They e-mail their comment, which goes to somemailboxname+blahblah@gmail.google.com
If Google doesn't consider it spam, then the message gets forwarded to a secret mailbox on the blog server.
A script running on the blog server parses the message, determines what the comment is, and which article to append to.
Appends the comment.
Comment removed (Score:3, Interesting)