Photo for Glenn Fleishman

Blog

Writing

What I Do

Biography

GlennLog

Turning technology from mumbo-jumbo into rich tasty gumbo

� Turning Movable Type Posts into Email Messages | Main | Today, I Live in the Book �

October 15, 2003

How to Beat Bayes

Spam is an adaptive virus: we only see the successes, as more and more filtering wipe out the less adaptive versions. Lately, I've been seeing an increasing amount of spam that's passed through three layers of filtering, two of them involving Bayesian notions of word frequency. This new spam has a bunch of randomly created word-length text strings. The subject lines have punctuation introduced in strange places so that the words are legible, but they don't "read" as words. (Of course, an easy parsing solution is to normalize words and then run filters against them.)

Obviously, this is the latest end-run around the latest spam innovation. It shows that Bayesian filtering, while a wonderful idea, has its limits because of spammers' cleverness and adaptability.

Ultimately, these exercises show that no matter what algorithm we use, spam will still filter through. (I'm still seeing Nigerian variants, which amazes me.) The next approach is going to be digital certificate-based: you can't forge those, and you prevent non-trusted sources from connecting. If you put certificates on the mail servers -- and make sure that VeriSign isn't the only company controlling the issuing of these certificates, but that non-profits and other organizations can be root certificate authorities -- then only mail servers configured with them will be able to exchange email with other servers.

It'll be tricky, but I believe the next change in the net will come that way. Technology and legislation aren't stopping spam. Digital certificates could dramatically reduce it because of the ability to revoke certificates, eliminating an entire mail server from a system without requiring a blacklist. (Yeah, and then who decides to revoke certificates? And on and on.)

Posted by Glennf at October 15, 2003 8:00 AM

Trackback Pings

TrackBack URL for this entry:
https://db.isbn.nu/mt3/mt-tb.pl/2068

Listed below are links to weblogs that reference How to Beat Bayes:

The spammers are gaining again from Compendium
Bayesian filtering has become the hot new thing in fighting spam. But as Glenn Fleishman writes, the spammers are adapting. [Read More]

Tracked on October 16, 2003 6:45 AM

A rose by any other name from forebrain
GlennLog The next approach is going to be digital certificate-based: you can't forge those, and you prevent non-trusted sources from connecting. Ok, fine. Digital certificates are good. But how do you decide who's "trusted"? And why is that process any... [Read More]

Tracked on October 16, 2003 7:50 AM

Comments

The strategy of filling up messages with random words is not really new, or effective. Every "Bayesian" classifier of my acquaintance looks at a subset of statistically interesting words in the message. Made-up words (being new to the corpus) are not interesting and thus don't figure into the calculation.

In HTML email, it's possible to break up real words with HTML comments. Any decent filter will strip out HTML comments to glue bisected words back together for exactly this reason.

Mispunctuated or otherwise disfigured words are also not a viable spammer survival strategy. As Paul Graham put it [1], "'c0ck' is far more damning evidence than 'cock', and Bayesian filters know precisely how much more."

[1] http://www.paulgraham.com/spam.html

You've been pushing the digital-signature solution for quite some time [2], but I still don't see it in the cards. SpamSieve [3] is sufficiently effective that I don't worry about spam any more -- a few get through in a given week, while thousands don't.

[2]
http://blog.glennf.com/2001/03/25.html
http://blog.glennf.com/mtarchives/000557.html

[3] http://c-command.com/spamsieve/index.shtml

Even if I accept that current strategies are so flawed as to suggest a spam deluge in the offing, it's hard to get motivated about a solution that requires every email user, every sysadmin, and every trivial mail-sending perl script on the entire planet to upgrade to software that doesn't yet exist. The scope of your proposal is ambitious beyond precedent.

A permanent spam solution has to find its own tipping point, or it's dead on arrival. I'm responsible for two mail servers (one of them on the Xserve that sits directly above isbn.nu, oddly enough) and to endorse a top-down solution like a PKI is to dictate the software choices of the correspondents of my mail servers' users. I'd feel obliged to fire myself.

Posted by: Nat Irons at October 16, 2003 10:53 PM

May 2008
Sun Mon Tue Wed Thu Fri Sat
        1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31

Recent Entries

Archives


May 2008 | April 2008 | March 2008 | February 2008 | January 2008 | December 2007 | November 2007 | October 2007 | September 2007 | August 2007 | July 2007 | June 2007 | May 2007 | April 2007 | March 2007 | February 2007 | January 2007 | December 2006 | November 2006 | October 2006 | September 2006 | August 2006 | July 2006 | June 2006 | May 2006 | April 2006 | March 2006 | February 2006 | January 2006 | December 2005 | November 2005 | October 2005 | September 2005 | August 2005 | July 2005 | June 2005 | May 2005 | April 2005 | March 2005 | February 2005 | January 2005 | December 2004 | November 2004 | October 2004 | September 2004 | August 2004 | July 2004 | June 2004 | May 2004 | April 2004 | March 2004 | February 2004 | January 2004 | December 2003 | November 2003 | October 2003 | September 2003 | August 2003 | July 2003 | June 2003 | May 2003 | April 2003 | March 2003 | February 2003 | January 2003 | December 2002 | November 2002 | October 2002 | September 2002 | August 2002 | July 2002 | June 2002 | May 2002 | April 2002 | March 2002 | February 2002 | January 2002 | December 2001 | November 2001 | October 2001 |

Powered by Movable Type 3.33