Statistical Spam Filter Review
Since October 2002 I have used a probability based statistical
spam filter script written in ksh and awk inspired by
Paul Graham.
It is a personal email filter I wrote for Unix-Linux type hosts, not a
mail server filter. The link above includes the script and how I think it
works. The spam filter is not Bayesian, but seems to have similar
results to other probability based spam filters I have seen reviewed,
and similar strengths and weaknesses.
A statistical approach by itself may misclassify
some good mail, for my filter so far it has been HTML encoded mail.
However, combined with a "learning" type whitelist of good addresses it
limits that problem to good email whose address is unknown
to the whitelist. For my mail mix it is a very small subset of mail.
The white list consists of addresses that I generate from my body of
"good" email when I train the spam filter.
Problems could occur with my whitelist approach. When the whitelist
passes the email on, it is a "trusted" email and no further
filtering happens. This could pass an email that has attacked
the host of a whitelisted address like a worm or virus, or a forged
address on the white list. I have tried to remove the most commonly forged
addresses, like my own, but I have noticed a few attacks by viruses like this,
but only a very few.
The whitelist improves overall performance of the filter as it is several
magnitudes of 10 times faster compared to the rest of the filtering script.
The filter runs the whitelist first, then the probability filter.
With the current set of training email it takes 8/10's of a second to
filter an email on an old Sun Ultra-80 using NAWK when it is not recognized
as an address from the whitelist. A 750mhz AMD Linux box is much
faster using GAWK.
In my filter script most of the computing time is spent loading
the associative array with tens of thousands of tokens and spaminess values.
This happens for each email received. The bigger
this associative array, the slower the filter, it is currently at 40,000+
tokens. The lookup on an associative array is very fast and the
average number of tokens looked up in email does not change much, it is the
loading of the array in memory that is the performance problem.
Changing the language, as GAWK does not have a DBM interface and using
DBM type files instead of building a large associative array in memory
may speed performance. Also running it as a daemon somewhere in the
processing of mail would limit the performance problem to starting
the daemon and the creating of the token-value associate array in memory,
but that also requires a redesign and possibly privileges
a regular user would not have.
To stabilize current performance and space I limit the training set of emails.
The current mix of training email has 1,700 spam emails and 1,600 good emails.
Every so often I prune back the spam mail, rarer the good mail files,
as a result, the set of training mail does not reflect my real world
80-90% spam rate but is closer to a 50-50% mix which I am sure biases my
good token spam scores. I have not tested what it does to my filter
effectiveness, and I may look at this later. 7-10 MB is all the space
I want to devote to the filter, and its files of email and spam.
A test on one thousand mails not seen by the filter training gave me 93+%
of the spam caught, this filter works well enough for me.
These results are low compared to some other probability based filters,
but higher than most rule based filters. I do not claim to be the best,
just OK. The test script is bundled with the filter script referenced above.
Real world results may be better (or worse) as the 1000 emails were run
through as a batch so the filter was not trained on the 1000 emails as
they were processed, so no additional "learning" happened. I generally
train the filter every few days, the 1000 emails tested was about a
months worth of mail. In the test set about 800 mails were spam, about
200 were good, and no good mail was misclassified by this test, but that
was just the luck of the draw in the sample.
LATER NOTE: I did look at the ratio of spam to good email in my training
set and I found much better results when I changed to a ratio
of spam/good that was closer to the actual mix of spam coming in the
mail. I was able to cut my training set to about 1000 spam emails,
and 200-300 good mails and it has been a month since a spam has gotten through.
The incidents of misclassifying good email as spam for the last year or so
were all the same type: HTML encoded email. The incidents are rare,
about 1/1000 good emails, but painful. An example was an HTML encoded
rebate email from Dell. What a pain, the HTML encoding gave it a super high
spam rating. When decoded to ascii text it had a very low spam rating.
The spam email that does get through the filter mostly falls into the following
categories:
- Base64 encoded MIME.
I do not decode these and there are very few tokens that the filter
finds "interesting" in the header. The long tokens in the
base64 message are not common enough to be recognized by the filter.
Decoding and then filtering the email would remove this problem.
However, this type of spam have not been very common in the last
few months in my mail.
- Nigerian scams and other very long winded spams.
This seems the most common to escape the filter. Fewer pass the
filter when the training set of email got larger. These spams use
lots of tokens and they bury the spam under a blizzard of words,
some of which counters the weight of the spam words. This may be
an effect of the way I calculate "spaminess", as an average of token
probability, not a typical bayes type of calculation. See my link
above to my script for my method of calculating spaminess of email.
- HTML Comment loaded text and fake HTML tags.
This busts the spammy tokens into pieces and loads up the filter with
a lot of tokens. Sometimes it has a Base 64 effect, creating long
tokens that are not common enough to be recognized by the filter.
These have become much less common lately in my mail mix.
- Further work:
Decoding encoded emails (Base64 MIME, and HTML)
would catch many of the problems this filter has with MIME and HTML encoded
spam, including misclassification of good email. The
Nigerian Scam type spam would still be a problem, it is probably
an effect of the method I use to calculate my spam values, see my link above
to see the lame method I use.
Multiple token classification is another way of improving the
filter accuracy. My current filter looks at single tokens, not
token stream subsets or token proximity. Token subsets like phrase analysis
are already used in probability based systems of classification.
CRM114 uses phrase analysis
techniques and has links to explain some of these methods in greater
detail.
Would I use unfiltered mail again? No. Even my unsophisticated, crude
statistical filter script works better than unfiltered email or rule
based schemes. My results are from the characteristics of my filter
and the mail I receive but they are not way out of sync with other
statistical filters I have seen reviewed.