pgstrata
A Plan for Spam
2

Like to build things? Try Hacker News.

3

August 2002

4

*(This article describes the spam-filtering techniques used in the spamproof web-based mail reader we built to exercise Arc [blocked].

5

An improved algorithm is described in Better Bayesian Filtering [blocked].)*

6

I think it's possible to stop spam, and that content-based filters are the way to do it.

7

The Achilles heel of the spammers is their message.

8

They can circumvent any other barrier you set up.

9

They have so far, at least. But they have to deliver their message, whatever it is.

10

If we can write software that recognizes their messages, there is no way they can get around that.

6–10

I think it's possible to stop spam, and that content-based filters are the way to do it. The Achilles heel of the spammers is their message. They can circumvent any other barrier, but they have to deliver their message — and if we can write software that recognizes it, there is no way they can get around that.

2–10

Spam can be stopped, and content-based filters are the way. Spammers can dodge any other barrier, but they have to deliver their message — and software can be written to recognize it.

12

_ _ _

13

To the recipient, spam is easily recognizable.

14

If you hired someone to read your mail and discard the spam, they would have little trouble doing it.

15

How much do we have to do, short of AI, to automate this process?

16

I think we will be able to solve the problem with fairly simple algorithms. In fact, I've found that you can filter present-day spam acceptably well using nothing more than a Bayesian combination of the spam probabilities of individual words.

17

Using a slightly tweaked (as described below) Bayesian filter, we now miss less than 5 per 1000 spams, with 0 false positives.

18

The statistical approach is not usually the first one people try when they write spam filters.

19

Most hackers' first instinct is to try to write software that recognizes individual properties of spam.

20

You look at spams and you think, the gall of these guys to try sending me mail that begins "Dear Friend" or has a subject line that's all uppercase and ends in eight exclamation points.

21

I can filter out that stuff with about one line of code.

22

And so you do, and in the beginning it works.

23

A few simple rules will take a big bite out of your incoming spam.

24

Merely looking for the word "click" will catch 79.7% of the emails in my spam corpus, with only 1.2% false positives.

25

I spent about six months writing software that looked for individual spam features before I tried the statistical approach.

26

What I found was that recognizing that last few percent of spams got very hard, and that as I made the filters stricter I got more false positives.

27

False positives are innocent emails that get mistakenly identified as spams. For most users, missing legitimate email is an order of magnitude worse than receiving spam, so a filter that yields false positives is like an acne cure that carries a risk of death to the patient.

28

The more spam a user gets, the less likely he'll be to notice one innocent mail sitting in his spam folder.

29

And strangely enough, the better your spam filters get, the more dangerous false positives become, because when the filters are really good, users will be more likely to ignore everything they catch.

30

I don't know why I avoided trying the statistical approach for so long.

31

I think it was because I got addicted to trying to identify spam features myself, as if I were playing some kind of competitive game with the spammers. (Nonhackers don't often realize this, but most hackers are very competitive.)

32

When I did try statistical analysis, I found immediately that it was much cleverer than I had been.

33

It discovered, of course, that terms like "virtumundo" and "teens" were good indicators of spam.

34

But it also discovered that "per" and "FL" and "ff0000" are good indicators of spam.

35

In fact, "ff0000" (html for bright red) turns out to be as good an indicator of spam as any pornographic term.

13–17

To the recipient, spam is easily recognizable; someone hired to read your mail and discard it would have little trouble. How much do we have to do, short of AI, to automate that? Fairly simple algorithms will do: a Bayesian combination of the spam probabilities of individual words now misses less than 5 per 1000 spams, with 0 false positives.

18–21

The statistical approach isn't the first one people try. Most hackers' first instinct is to recognize individual properties of spam: you see mail beginning "Dear Friend" or with an all-uppercase subject ending in eight exclamation points, and think, I can filter that out with about one line of code.

22–26

And so you do, and at first it works — merely looking for the word "click" catches 79.7% of my spam corpus, with 1.2% false positives. But after six months of this, recognizing the last few percent got very hard, and the stricter I made the filters, the more false positives I got.

27

False positives are innocent emails that get mistakenly identified as spams. For most users, missing legitimate email is an order of magnitude worse than receiving spam, so a filter that yields false positives is like an acne cure that carries a risk of death to the patient.

30–35

I don't know why I avoided statistics so long — I think I got addicted to identifying spam features myself, as if playing a competitive game with the spammers. When I finally tried statistical analysis it was immediately much cleverer than I had been. It found that "virtumundo" and "teens" indicate spam, but also "per" and "FL" and "ff0000" — in fact "ff0000" (html for bright red) is as good an indicator as any pornographic term.

12–35

Spam is easy to recognize, and simple algorithms can do it: a Bayesian combination of per-word spam probabilities now misses under 5 spams per 1000 with 0 false positives. Most hackers instead chase individual spam features, which works at first but fails on the last few percent.

37

_ _ _

38

Here's a sketch of how I do statistical filtering.

39

I start with one corpus of spam and one of nonspam mail.

40

At the moment each one has about 4000 messages in it.

41

I scan the entire text, including headers and embedded html and javascript, of each message in each corpus.

42

I currently consider alphanumeric characters, dashes, apostrophes, and dollar signs to be part of tokens, and everything else to be a token separator. (There is probably room for improvement here.)

43

I ignore tokens that are all digits, and I also ignore html comments, not even considering them as token separators.

44

I count the number of times each token (ignoring case, currently) occurs in each corpus.

45

At this stage I end up with two large hash tables, one for each corpus, mapping tokens to number of occurrences.

46

Next I create a third hash table, this time mapping each token to the probability that an email containing it is a spam, which I calculate as follows [1]: (let ((g (* 2 (or (gethash word good) 0))) (b (or (gethash word bad) 0))) (unless (< (+ g b) 5) (max .01 (min .99 (float (/ (min 1 (/ b nbad)) (+ (min 1 (/ g ngood)) (min 1 (/ b nbad))))))))) where word is the token whose probability we're calculating, good and bad are the hash tables I created in the first step, and ngood and nbad are the number of nonspam and spam messages respectively.

47

I explained this as code to show a couple of important details.

48

I want to bias the probabilities slightly to avoid false positives, and by trial and error I've found that a good way to do it is to double all the numbers in good.

49

This helps to distinguish between words that occasionally do occur in legitimate email and words that almost never do.

50

I only consider words that occur more than five times in total (actually, because of the doubling, occurring three times in nonspam mail would be enough).

51

And then there is the question of what probability to assign to words that occur in one corpus but not the other.

52

Again by trial and error I chose .01 and .99.

53

There may be room for tuning here, but as the corpus grows such tuning will happen automatically anyway.

54

The especially observant will notice that while I consider each corpus to be a single long stream of text for purposes of counting occurrences, I use the number of emails in each, rather than their combined length, as the divisor in calculating spam probabilities.

55

This adds another slight bias to protect against false positives.

56

When new mail arrives, it is scanned into tokens, and the most interesting fifteen tokens, where interesting is measured by how far their spam probability is from a neutral .5, are used to calculate the probability that the mail is spam.

57

If probs is a list of the fifteen individual probabilities, you calculate the combined [blocked] probability thus: (let ((prod (apply #'* probs))) (/ prod (+ prod (apply #'* (mapcar #'(lambda (x) (- 1 x)) probs))))) One question that arises in practice is what probability to assign to a word you've never seen, i.e. one that doesn't occur in the hash table of word probabilities.

58

I've found, again by trial and error, that .4 is a good number to use.

59

If you've never seen a word before, it is probably fairly innocent; spam words tend to be all too familiar.

60

There are examples of this algorithm being applied to actual emails in an appendix at the end.

61

I treat mail as spam if the algorithm above gives it a probability of more than .9 of being spam.

62

But in practice it would not matter much where I put this threshold, because few probabilities end up in the middle of the range.

38–45

Here's a sketch. I start with one corpus of spam and one of nonspam, about 4000 messages each. I scan the entire text — headers, html, javascript — treating alphanumerics, dashes, apostrophes, and dollar signs as part of tokens and everything else as a separator. Counting how often each token occurs in each corpus gives two hash tables mapping tokens to occurrences.

46–53

Next I build a third table mapping each token to the probability that an email containing it is spam. Two details matter: to bias against false positives I double the counts in good, distinguishing words that occasionally occur in legitimate mail from words that almost never do; and for words in one corpus but not the other I chose .01 and .99 by trial and error. As the corpus grows, such tuning happens automatically.

56–62

New mail is tokenized, and the fifteen most interesting tokens — those farthest from a neutral .5 — calculate the spam probability. A never-seen word gets .4: if you've never seen a word before it is probably innocent, since spam words tend to be all too familiar. I treat mail as spam above .9, though the threshold barely matters, because few probabilities land in the middle.

37–62

Scan a corpus of spam and one of nonspam, count token occurrences, and build a table of per-word spam probabilities — biased slightly toward innocence to avoid false positives. New mail is scored from its fifteen most interesting tokens; treat it as spam above .9.

64

_ _ _

65

One great advantage of the statistical approach is that you don't have to read so many spams. Over the past six months, I've read literally thousands of spams, and it is really kind of demoralizing.

66

Norbert Wiener said if you compete with slaves you become a slave, and there is something similarly degrading about competing with spammers.

67

To recognize individual spam features you have to try to get into the mind of the spammer, and frankly I want to spend as little time inside the minds of spammers as possible.

68

But the real advantage of the Bayesian approach, of course, is that you know what you're measuring.

69

Feature-recognizing filters like SpamAssassin assign a spam "score" to email.

70

The Bayesian approach assigns an actual probability.

71

The problem with a "score" is that no one knows what it means.

72

The user doesn't know what it means, but worse still, neither does the developer of the filter.

73

How many points should an email get for having the word "sex" in it?

74

A probability can of course be mistaken, but there is little ambiguity about what it means, or how evidence should be combined to calculate it.

75

Based on my corpus, "sex" indicates a .97 probability of the containing email being a spam, whereas "sexy" indicates .99 probability.

76

And Bayes' Rule, equally unambiguous, says that an email containing both words would, in the (unlikely) absence of any other evidence, have a 99.97% chance of being a spam.

77

Because it is measuring probabilities, the Bayesian approach considers all the evidence in the email, both good and bad.

78

Words that occur disproportionately rarely in spam (like "though" or "tonight" or "apparently") contribute as much to decreasing the probability as bad words like "unsubscribe" and "opt-in" do to increasing it.

79

So an otherwise innocent email that happens to include the word "sex" is not going to get tagged as spam.

80

Ideally, of course, the probabilities should be calculated individually for each user.

81

I get a lot of email containing the word "Lisp", and (so far) no spam that does.

82

So a word like that is effectively a kind of password for sending mail to me.

83

In my earlier spam-filtering software, the user could set up a list of such words and mail containing them would automatically get past the filters.

84

On my list I put words like "Lisp" and also my zipcode, so that (otherwise rather spammy-sounding) receipts from online orders would get through.

85

I thought I was being very clever, but I found that the Bayesian filter did the same thing for me, and moreover discovered of a lot of words I hadn't thought of.

86

When I said at the start that our filters let through less than 5 spams per 1000 with 0 false positives, I'm talking about filtering my mail based on a corpus of my mail.

87

But these numbers are not misleading, because that is the approach I'm advocating: filter each user's mail based on the spam and nonspam mail he receives.

88

Essentially, each user should have two delete buttons, ordinary delete and delete-as-spam.

89

Anything deleted as spam goes into the spam corpus, and everything else goes into the nonspam corpus.

90

You could start users with a seed filter, but ultimately each user should have his own per-word probabilities based on the actual mail he receives.

91

This (a) makes the filters more effective, (b) lets each user decide their own precise definition of spam, and (c) perhaps best of all makes it hard for spammers to tune mails to get through the filters.

92

If a lot of the brain of the filter is in the individual databases, then merely tuning spams to get through the seed filters won't guarantee anything about how well they'll get through individual users' varying and much more trained filters.

93

Content-based spam filtering is often combined with a whitelist, a list of senders whose mail can be accepted with no filtering.

94

One easy way to build such a whitelist is to keep a list of every address the user has ever sent mail to.

95

If a mail reader has a delete-as-spam button then you could also add the from address of every email the user has deleted as ordinary trash.

96

I'm an advocate of whitelists, but more as a way to save computation than as a way to improve filtering.

97

I used to think that whitelists would make filtering easier, because you'd only have to filter email from people you'd never heard from, and someone sending you mail for the first time is constrained by convention in what they can say to you.

98

Someone you already know might send you an email talking about sex, but someone sending you mail for the first time would not be likely to.

99

The problem is, people can have more than one email address, so a new from-address doesn't guarantee that the sender is writing to you for the first time.

100

It is not unusual for an old friend (especially if he is a hacker) to suddenly send you an email with a new from-address, so you can't risk false positives by filtering mail from unknown addresses especially stringently.

101

In a sense, though, my filters do themselves embody a kind of whitelist (and blacklist) because they are based on entire messages, including the headers.

102

So to that extent they "know" the email addresses of trusted senders and even the routes by which mail gets from them to me.

103

And they know the same about spam, including the server names, mailer versions, and protocols.

65–67

One advantage is you don't have to read so many spams. Over six months I've read literally thousands, and it is demoralizing. Norbert Wiener said if you compete with slaves you become a slave, and competing with spammers is similarly degrading.

68–76

But the real advantage is that you know what you're measuring. Feature-recognizing filters like SpamAssassin assign a "score" no one — not even the developer — can interpret: how many points should "sex" earn? A probability is unambiguous. In my corpus "sex" indicates .97 and "sexy" .99, and Bayes' Rule says an email with both would, absent other evidence, have a 99.97% chance of being spam.

77–79

Because it measures probabilities, the approach considers all evidence, good and bad. Words disproportionately rare in spam — "though," "tonight," "apparently" — decrease the probability as much as "unsubscribe" and "opt-in" increase it, so an otherwise innocent email containing "sex" won't get tagged.

80–85

Ideally probabilities are calculated per user. I get a lot of email containing "Lisp" and no spam that does, so that word is effectively a password for reaching me. I used to hand-list such words; the Bayesian filter did it for me automatically, and discovered many I hadn't thought of.

86–92

And per-user is the approach I advocate. Each user should have two delete buttons, ordinary delete and delete-as-spam, feeding the two corpora. You could seed users with a starter filter, but each should end up with his own per-word probabilities. This makes filters more effective and, best of all, makes spam hard to tune: if the brain lives in individual databases, beating the seed filter guarantees nothing.

96–103

Content-based filtering is often paired with a whitelist of senders accepted unfiltered. I favor whitelists, but to save computation, not improve filtering — people have more than one address, and an old friend may suddenly write from a new one, so you can't filter unknown senders especially stringently. In a sense my filters already embody a whitelist and blacklist, since they're based on entire messages including headers, so they "know" trusted senders' addresses and routes — and spam's server names, mailers, and protocols.

64–103

Beyond sparing you from reading spam, the Bayesian approach yields an actual probability rather than a meaningless "score," weighs innocent words as much as guilty ones, and works best computed per user — each person's own corpus becomes a brain spammers can't tune against.

105

_ _ _

106

If I thought that I could keep up current rates of spam filtering, I would consider this problem solved.

107

But it doesn't mean much to be able to filter out most present-day spam, because spam evolves.

108

Indeed, most antispam techniques [blocked] so far have been like pesticides that do nothing more than create a new, resistant strain of bugs.

109

I'm more hopeful about Bayesian filters, because they evolve with the spam.

110

So as spammers start using "c0ck" instead of "cock" to evade simple-minded spam filters based on individual words, Bayesian filters automatically notice.

111

Indeed, "c0ck" is far more damning evidence than "cock", and Bayesian filters know precisely how much more.

112

Still, anyone who proposes a plan for spam filtering has to be able to answer the question: if the spammers knew exactly what you were doing, how well could they get past you?

113

For example, I think that if checksum-based spam filtering becomes a serious obstacle, the spammers will just switch to mad-lib techniques for generating message bodies.

114

To beat Bayesian filters, it would not be enough for spammers to make their emails unique or to stop using individual naughty words.

115

They'd have to make their mails indistinguishable from your ordinary mail.

116

And this I think would severely constrain them.

117

Spam is mostly sales pitches, so unless your regular mail is all sales pitches, spams will inevitably have a different character.

118

And the spammers would also, of course, have to change (and keep changing) their whole infrastructure, because otherwise the headers would look as bad to the Bayesian filters as ever, no matter what they did to the message body.

119

I don't know enough about the infrastructure that spammers use to know how hard it would be to make the headers look innocent, but my guess is that it would be even harder than making the message look innocent.

120

Assuming they could solve the problem of the headers, the spam of the future will probably look something like this: Hey there.

121

Thought you should check out the following: http://www.27meg.com/foo because that is about as much sales pitch as content-based filtering will leave the spammer room to make. (Indeed, it will be hard even to get this past filters, because if everything else in the email is neutral, the spam probability will hinge on the url, and it will take some effort to make that look neutral.)

122

Spammers range from businesses running so-called opt-in lists who don't even try to conceal their identities, to guys who hijack mail servers to send out spams promoting porn sites.

123

If we use filtering to whittle their options down to mails like the one above, that should pretty much put the spammers on the "legitimate" end of the spectrum out of business; they feel obliged by various state laws to include boilerplate about why their spam is not spam, and how to cancel your "subscription," and that kind of text is easy to recognize.

124

(I used to think it was naive to believe that stricter laws would decrease spam.

125

Now I think that while stricter laws may not decrease the amount of spam that spammers send, they can certainly help filters to decrease the amount of spam that recipients actually see.)

126

All along the spectrum, if you restrict the sales pitches spammers can make, you will inevitably tend to put them out of business.

127

That word business is an important one to remember.

128

The spammers are businessmen.

129

They send spam because it works.

130

It works because although the response rate is abominably low (at best 15 per million, vs 3000 per million for a catalog mailing), the cost, to them, is practically nothing.

131

The cost is enormous for the recipients, about 5 man-weeks for each million recipients who spend a second to delete the spam, but the spammer doesn't have to pay that.

132

Sending spam does cost the spammer something, though. [2] So the lower we can get the response rate-- whether by filtering, or by using filters to force spammers to dilute their pitches-- the fewer businesses will find it worth their while to send spam.

133

The reason the spammers use the kinds of sales pitches that they do is to increase response rates.

134

This is possibly even more disgusting than getting inside the mind of a spammer, but let's take a quick look inside the mind of someone who responds to a spam.

135

This person is either astonishingly credulous or deeply in denial about their sexual interests.

136

In either case, repulsive or idiotic as the spam seems to us, it is exciting to them.

137

The spammers wouldn't say these things if they didn't sound exciting.

138

And "thought you should check out the following" is just not going to have nearly the pull with the spam recipient as the kinds of things that spammers say now.

139

Result: if it can't contain exciting sales pitches, spam becomes less effective as a marketing vehicle, and fewer businesses want to use it.

140

That is the big win in the end.

141

I started writing spam filtering software because I didn't want have to look at the stuff anymore.

142

But if we get good enough at filtering out spam, it will stop working, and the spammers will actually stop sending it.

106–111

Filtering today's spam doesn't mean much, because spam evolves. Most antispam techniques so far have been like pesticides that merely breed a new, resistant strain of bugs. I'm more hopeful about Bayesian filters, because they evolve with the spam: as spammers switch to "c0ck" to dodge naive word filters, "c0ck" becomes far more damning evidence than "cock," and the filters know precisely how much more.

114–121

To beat Bayesian filters, spammers would have to make their mail indistinguishable from yours — which would severely constrain them. Spam is mostly sales pitches, so unless your regular mail is too, spam will have a different character. And they'd have to keep changing their whole infrastructure, or the headers stay damning. Assuming they solved that, the spam of the future would read: "Hey there. Thought you should check out the following: http://www.27meg.com/foo" — about as much pitch as filtering leaves room for.

126–132

All along the spectrum, restricting the pitches spammers can make puts them out of business — and business is the word to remember. Spammers are businessmen; they send spam because it works, since the response rate is abominably low (at best 15 per million, vs 3000 for a catalog) but the cost to them is practically nothing. So the lower we push the response rate, the fewer businesses find spam worth sending.

133–139

Look inside the mind of someone who responds to a spam: astonishingly credulous or deeply in denial about their sexual interests. Repulsive as the spam is to us, it's exciting to them — spammers wouldn't say these things otherwise. Strip out the exciting pitch and spam becomes a weaker marketing vehicle, and fewer businesses want it.

140–142

That is the big win in the end. I started writing this software because I didn't want to look at the stuff anymore. But if we get good enough at filtering it out, it will stop working, and the spammers will actually stop sending it.

105–142

Filtering today's spam isn't enough, since spam evolves — but Bayesian filters evolve with it. To beat them, spammers would have to make mail indistinguishable from yours, gutting the pitch. Since spammers are businessmen, that puts them out of business: the real win is that spam stops working.

144

_ _ _

145

Of all the approaches to fighting spam, from software to laws, I believe Bayesian filtering will be the single most effective.

146

But I also think that the more different kinds of antispam efforts we undertake, the better, because any measure that constrains spammers will tend to make filtering easier.

147

And even within the world of content-based filtering, I think it will be a good thing if there are many different kinds of software being used simultaneously.

148

The more different filters there are, the harder it will be for spammers to tune spams to get through them.

145–148

Of all the approaches, from software to laws, I believe Bayesian filtering will be the single most effective. But the more different antispam efforts we undertake the better, since any constraint on spammers makes filtering easier — and many different filters running at once means spammers can't tune their way through them all.

144–148

Bayesian filtering will be the single most effective approach, but the more kinds of antispam efforts there are — and the more different filters in use — the harder it is for spammers to tune their way through.

150

Appendix: Examples of Filtering

151

Here is an example of a spam that arrived while I was writing this article.

152

The fifteen most interesting words in this spam are: qvp0045 indira mx-05 intimail $7500 freeyankeedom cdo bluefoxmedia jpg unsecured platinum 3d0 qves 7c5 7c266675 The words are a mix of stuff from the headers and from the message body, which is typical of spam.

153

Also typical of spam is that every one of these words has a spam probability, in my database, of .99.

154

In fact there are more than fifteen words with probabilities of .99, and these are just the first fifteen seen.

155

Unfortunately that makes this email a boring example of the use of Bayes' Rule.

156

To see an interesting variety of probabilities we have to look at this actually quite atypical spam.

157

The fifteen most interesting words in this spam, with their probabilities, are: madam 0.99 promotion 0.99 republic 0.99 shortest 0.047225013 mandatory 0.047225013 standardization 0.07347802 sorry 0.08221981 supported 0.09019077 people's 0.09019077 enter 0.9075001 quality 0.8921298 organization 0.12454646 investment 0.8568143 very 0.14758544 valuable 0.82347786 This time the evidence is a mix of good and bad.

158

A word like "shortest" is almost as much evidence for innocence as a word like "madam" or "promotion" is for guilt.

159

But still the case for guilt is stronger.

160

If you combine these numbers according to Bayes' Rule, the resulting probability is .9027.

161

"Madam" is obviously from spams beginning "Dear Sir or Madam." They're not very common, but the word "madam" never occurs in my legitimate email, and it's all about the ratio.

162

"Republic" scores high because it often shows up in Nigerian scam emails, and also occurs once or twice in spams referring to Korea and South Africa.

163

You might say that it's an accident that it thus helps identify this spam.

164

But I've found when examining spam probabilities that there are a lot of these accidents, and they have an uncanny tendency to push things in the right direction rather than the wrong one.

165

In this case, it is not entirely a coincidence that the word "Republic" occurs in Nigerian scam emails and this spam.

166

There is a whole class of dubious business propositions involving less developed countries, and these in turn are more likely to have names that specify explicitly (because they aren't) that they are republics.[3]

167

On the other hand, "enter" is a genuine miss.

168

It occurs mostly in unsubscribe instructions, but here is used in a completely innocent way.

169

Fortunately the statistical approach is fairly robust, and can tolerate quite a lot of misses before the results start to be thrown off.

170

For comparison, here is an example of that rare bird, a spam that gets through the filters.

171

Why?

172

Because by sheer chance it happens to be loaded with words that occur in my actual email: perl 0.01 python 0.01 tcl 0.01 scripting 0.01 morris 0.01 graham 0.01491078 guarantee 0.9762507 cgi 0.9734398 paul 0.027040077 quite 0.030676773 pop3 0.042199217 various 0.06080265 prices 0.9359873 managed 0.06451222 difficult 0.071706355 There are a couple pieces of good news here.

173

First, this mail probably wouldn't get through the filters of someone who didn't happen to specialize in programming languages and have a good friend called Morris.

174

For the average user, all the top five words here would be neutral and would not contribute to the spam probability.

175

Second, I think filtering based on word pairs (see below) might well catch this one: "cost effective", "setup fee", "money back" -- pretty incriminating stuff.

176

And of course if they continued to spam me (or a network I was part of), "Hostex" itself would be recognized as a spam term.

177

Finally, here is an innocent email.

178

Its fifteen most interesting words are as follows: continuation 0.01 describe 0.01 continuations 0.01 example 0.033600237 programming 0.05214485 i'm 0.055427782 examples 0.07972858 color 0.9189189 localhost 0.09883721 hi 0.116539136 california 0.84421706 same 0.15981844 spot 0.1654587 us-ascii 0.16804294 what 0.19212411 Most of the words here indicate the mail is an innocent one.

179

There are two bad smelling words, "color" (spammers love colored fonts) and "California" (which occurs in testimonials and also in menus in forms), but they are not enough to outweigh obviously innocent words like "continuation" and "example".

180

It's interesting that "describe" rates as so thoroughly innocent.

181

It hasn't occurred in a single one of my 4000 spams. The data turns out to be full of such surprises.

182

One of the things you learn when you analyze spam texts is how narrow a subset of the language spammers operate in.

183

It's that fact, together with the equally characteristic vocabulary of any individual user's mail, that makes Bayesian filtering a good bet.

151–155

Here is a spam that arrived while I was writing this. Its fifteen most interesting words — qvp0045, indira, $7500, freeyankeedom, platinum — mix header and body material, as spam does. Every one has a probability of .99, which makes it a boring example of Bayes' Rule.

156–160

An atypical spam is more interesting, mixing good and bad: madam .99, promotion .99, but shortest .047, sorry .082. "Shortest" is almost as much evidence for innocence as "madam" is for guilt — but the case for guilt is stronger, and Bayes' Rule combines them to .9027.

164–166

Examining spam probabilities, I find a lot of these accidents, and they have an uncanny tendency to push things in the right direction. It's not pure coincidence: there's a whole class of dubious propositions involving less developed countries, which are likelier to have names insisting explicitly — because they aren't — that they are republics.

170–174

For comparison, here is that rare bird, a spam that gets through — because by sheer chance it's loaded with words from my actual email: perl, python, tcl, morris, all .01. It wouldn't get through the filters of someone who didn't specialize in programming languages and have a friend called Morris; for the average user those words would be neutral.

182–183

One thing you learn analyzing spam is how narrow a subset of the language spammers operate in. That fact, plus the equally characteristic vocabulary of each user's mail, is what makes Bayesian filtering a good bet.

150–183

Three worked examples: a boring all-.99 spam, an atypical one combining good and bad words to .9027, and a rare spam that slips through because it's loaded with a programmer's vocabulary — proof of how narrow a slice of language spam occupies.

185

Appendix: More Ideas

186

One idea that I haven't tried yet is to filter based on word pairs, or even triples, rather than individual words.

187

This should yield a much sharper estimate of the probability.

188

For example, in my current database, the word "offers" has a probability of .96.

189

If you based the probabilities on word pairs, you'd end up with "special offers" and "valuable offers" having probabilities of .99 and, say, "approach offers" (as in "this approach offers") having a probability of .1 or less.

190

The reason I haven't done this is that filtering based on individual words already works so well.

191

But it does mean that there is room to tighten the filters if spam gets harder to detect. (Curiously, a filter based on word pairs would be in effect a Markov-chaining text generator running in reverse.)

192

Specific spam features (e.g. not seeing the recipient's address in the to: field) do of course have value in recognizing spam.

193

They can be considered in this algorithm by treating them as virtual words.

194

I'll probably do this in future versions, at least for a handful of the most egregious spam indicators.

195

Feature-recognizing spam filters are right in many details; what they lack is an overall discipline for combining evidence.

196

Recognizing nonspam features may be more important than recognizing spam features.

197

False positives are such a worry that they demand extraordinary measures.

198

I will probably in future versions add a second level of testing designed specifically to avoid false positives.

199

If a mail triggers this second level of filters it will be accepted even if its spam probability is above the threshold.

200

I don't expect this second level of filtering to be Bayesian.

201

It will inevitably be not only ad hoc, but based on guesses, because the number of false positives will not tend to be large enough to notice patterns. (It is just as well, anyway, if a backup system doesn't rely on the same technology as the primary system.)

202

Another thing I may try in the future is to focus extra attention on specific parts of the email.

203

For example, about 95% of current spam includes the url of a site they want you to visit. (The remaining 5% want you to call a phone number, reply by email or to a US mail address, or in a few cases to buy a certain stock.)

204

The url is in such cases practically enough by itself to determine whether the email is spam.

205

Domain names differ from the rest of the text in a (non-German) email in that they often consist of several words stuck together.

206

Though computationally expensive in the general case, it might be worth trying to decompose them.

207

If a filter has never seen the token "xxxporn" before it will have an individual spam probability of .4, whereas "xxx" and "porn" individually have probabilities (in my corpus) of .9889 and .99 respectively, and a combined probability of .9998.

208

I expect decomposing domain names to become more important as spammers are gradually forced to stop using incriminating words in the text of their messages. (A url with an ip address is of course an extremely incriminating sign, except in the mail of a few sysadmins.)

209

It might be a good idea to have a cooperatively maintained list of urls promoted by spammers.

210

We'd need a trust metric of the type studied by Raph Levien to prevent malicious or incompetent submissions, but if we had such a thing it would provide a boost to any filtering software.

211

It would also be a convenient basis for boycotts.

212

Another way to test dubious urls would be to send out a crawler to look at the site before the user looked at the email mentioning it.

213

You could use a Bayesian filter to rate the site just as you would an email, and whatever was found on the site could be included in calculating the probability of the email being a spam.

214

A url that led to a redirect would of course be especially suspicious.

215

One cooperative project that I think really would be a good idea would be to accumulate a giant corpus of spam.

216

A large, clean corpus is the key to making Bayesian filtering work well.

217

Bayesian filters could actually use the corpus as input.

218

But such a corpus would be useful for other kinds of filters too, because it could be used to test them.

219

Creating such a corpus poses some technical problems. We'd need trust metrics to prevent malicious or incompetent submissions, of course.

220

We'd also need ways of erasing personal information (not just to-addresses and ccs, but also e.g. the arguments to unsubscribe urls, which often encode the to-address) from mails in the corpus.

221

If anyone wants to take on this project, it would be a good thing for the world.

186–191

One idea I haven't tried is filtering on word pairs or triples. "Offers" has probability .96, but "special offers" would be .99 while "approach offers" would be .1 or less. I haven't done it because single words already work — but it leaves room to tighten the filters. (Curiously, a word-pair filter is in effect a Markov text generator running in reverse.)

196–201

Recognizing nonspam features may matter more than spam ones, because false positives demand extraordinary measures. I'll probably add a second level of testing designed to avoid them: trip it and the mail is accepted even above the threshold. This level won't be Bayesian — it'll be ad hoc and based on guesses, which is just as well, since a backup shouldn't rely on the same technology as the primary.

205–208

Domain names often consist of several words stuck together, so it might be worth decomposing them. A filter that's never seen "xxxporn" gives it .4, whereas "xxx" and "porn" individually are .9889 and .99, combining to .9998 — and I expect this to matter more as spammers are forced to drop incriminating words.

215–221

The cooperative project I'd most like is a giant, clean corpus of spam — the key to making Bayesian filtering work well, and useful for testing other filters. It would need trust metrics against bad submissions and ways of erasing personal information, including the arguments to unsubscribe urls that often encode the to-address. If anyone takes this on, it would be a good thing for the world.

185–221

Future improvements: filtering on word pairs, treating spam features as virtual words, a second non-Bayesian level guarding against false positives, decomposing domain names, and a cooperatively maintained corpus and url list — all guarded by trust metrics.

223

Appendix: Defining Spam

224

I think there is a rough consensus on what spam is, but it would be useful to have an explicit definition.

225

We'll need to do this if we want to establish a central corpus of spam, or even to compare spam filtering rates meaningfully.

226

To start with, spam is not unsolicited commercial email.

227

If someone in my neighborhood heard that I was looking for an old Raleigh three-speed in good condition, and sent me an email offering to sell me one, I'd be delighted, and yet this email would be both commercial and unsolicited.

228

The defining feature of spam (in fact, its raison d'etre) is not that it is unsolicited, but that it is automated.

229

It is merely incidental, too, that spam is usually commercial.

230

If someone started sending mass email to support some political cause, for example, it would be just as much spam as email promoting a porn site.

231

I propose we define spam as unsolicited automated email.

232

This definition thus includes some email that many legal definitions of spam don't.

233

Legal definitions of spam, influenced presumably by lobbyists, tend to exclude mail sent by companies that have an "existing relationship" with the recipient.

234

But buying something from a company, for example, does not imply that you have solicited ongoing email from them.

235

If I order something from an online store, and they then send me a stream of spam, it's still spam.

236

Companies sending spam often give you a way to "unsubscribe," or ask you to go to their site and change your "account preferences" if you want to stop getting spam.

237

This is not enough to stop the mail from being spam.

238

Not opting out is not the same as opting in.

239

Unless the recipient explicitly checked a clearly labelled box (whose default was no) asking to receive the email, then it is spam.

240

In some business relationships, you do implicitly solicit certain kinds of mail.

241

When you order online, I think you implicitly solicit a receipt, and notification when the order ships.

242

I don't mind when Verisign sends me mail warning that a domain name is about to expire (at least, if they are the actual registrar for it).

243

But when Verisign sends me email offering a FREE Guide to Building My E-Commerce Web Site, that's spam.

224–230

An explicit definition would help. Spam is not unsolicited commercial email: if a neighbor heard I wanted an old Raleigh three-speed and emailed to sell me one, I'd be delighted — yet that's both commercial and unsolicited. The defining feature of spam, its raison d'etre, is not that it's unsolicited but that it's automated. It's also incidental that spam is usually commercial: mass email for a political cause is just as much spam.

231–239

So I propose we define spam as unsolicited automated email. This catches mail that legal definitions — influenced presumably by lobbyists — exempt when a company has an "existing relationship" with you. But buying something doesn't imply you solicited ongoing email; an "unsubscribe" link doesn't make it not spam. Not opting out is not the same as opting in: unless you explicitly checked a clearly labelled box, defaulting to no, it is spam.

240–243

In some relationships you do implicitly solicit mail: ordering online, you solicit a receipt and a shipping notice. I don't mind when Verisign warns me a domain is about to expire — if they're the actual registrar. But when Verisign emails offering a FREE Guide to Building My E-Commerce Web Site, that's spam.

223–243

Spam isn't merely unsolicited or commercial mail — its defining feature is that it's automated. I propose defining spam as unsolicited automated email, which catches mail that "existing relationship" legal loopholes exempt: not opting out is not the same as opting in.

245

Notes:

246

[1] The examples in this article are translated into Common Lisp for, believe it or not, greater accessibility. The application described here is one that we wrote in order to test a new Lisp dialect called Arc [blocked] that is not yet released.

247

[2] Currently the lowest rate seems to be about $200 to send a million spams. That's very cheap, 1/50th of a cent per spam. But filtering out 95% of spam, for example, would increase the spammers' cost to reach a given audience by a factor of 20. Few can have margins big enough to absorb that.

248

[3] As a rule of thumb, the more qualifiers there are before the name of a country, the more corrupt the rulers. A country called The Socialist People's Democratic Republic of X is probably the last place in the world you'd want to live.

249

Thanks to Sarah Harlin for reading drafts of this; Daniel Giffin (who is also writing the production Arc interpreter) for several good ideas about filtering and for creating our mail infrastructure; Robert Morris, Trevor Blackwell and Erann Gat for many discussions about spam; Raph Levien for advice about trust metrics; and Chip Coldwell and Sam Steingold for advice about statistics.

247

The lowest rate to send a million spams is about $200 — 1/50th of a cent each — but filtering out 95% would raise the spammer's cost to reach a given audience twentyfold, and few have margins big enough to absorb that.

248

As a rule of thumb, the more qualifiers before the name of a country, the more corrupt the rulers: a country called The Socialist People's Democratic Republic of X is probably the last place you'd want to live.

245–249

The code examples are translated into Common Lisp from Arc, a not-yet-released dialect the mail reader was built to test; sending spam costs about $200 per million, so filtering 95% raises a spammer's cost twentyfold; and the more qualifiers before a country's name, the more corrupt its rulers.

251

More Info:

252

Plan for Spam FAQ [blocked] Better Bayesian Filtering Filters that Fight Back [blocked] Will Filters Kill Spam? [blocked] Probability [blocked] Spam is Different [blocked] Filters vs. Blacklists [blocked] Trust Metrics Filtering Research [blocked] Microsoft Patent [blocked] Slashdot Article The Wrong Way LWN: Filter Comparison CRM114 gets 99.87% [blocked]

252

Further reading: the Plan for Spam FAQ, Better Bayesian Filtering, Filters that Fight Back, Will Filters Kill Spam, and related links on naive Bayes probability, trust metrics, and filter comparisons.

251–252

Further reading: the Plan for Spam FAQ, Better Bayesian Filtering, Filters that Fight Back, Will Filters Kill Spam, and related links on probability, trust metrics, and filter comparisons.