TUNING BOGOFILTER'S ROBINSON-FISHER METHOD -- a HOWTO
(Greg Louis, April 2003)

CONTENTS
  Introduction
  Robinson's x
  Robinson's s
  The minimum deviation
  The spam and nonspam cutoffs
  How often to tune
  A note on training


INTRODUCTION

Currently, bogofilter offers a choice of three classification methods:
the original as proposed by Paul Graham and implemented in bogofilter
by Eric S. Raymond; a variation proposed by Gary Robinson and
implemented in bogofilter by Greg Louis; and a further variation, also
proposed by Gary Robinson, which uses Fisher's method (Fisher, R. A.,
1950: Statistical Methods for Research Workers, pp. 99ff.  London:
Oliver and Boyd) of combining probabilities; for bogofilter, your
author has implemented this one too.

Both of Gary Robinson's proposed classification methods work better
than the original.  For optimal results, they (and the original)
require some tuning.  As distributed, bogofilter attempts to supply good
starting values for the tunable parameters.  Since the optimum values
depend on the size and content of the wordlists at _your_ installation, 
the best results can only be determined by some experimentation using 
_your_ wordlists.  After several thousand each of spam and nonspam
messages have been fed to bogofilter for training, this experimentation
can be well worthwhile: you may be able to cut the number of spams
that are still getting through to half or less.

The purpose of this document is to explain how to adjust bogofilter's
parameters for best results with the Robinson-Fisher classification
method.  The author recommends using only this method, not
Robinson-geometric-mean or Graham, for spam filtering.

With Robinson's changes as implemented in bogofilter, there are four
things to tune, and they're highly interdependent, as explained below:


ROBINSON'S x

First off, you should determine the value of x appropriate to your
training database.

The way bogofilter works, summarizing briefly, is that the message
being classified is separated into "tokens" -- words, IP addresses and
other logical units of information.  Each token is looked up in the
wordlists that make up the training database.  The number of times it's
been seen in a spam message is divided by the total number of times
it's been seen, and that gives an indication of the likelihood that the
token is in a spam message.  The likelihood estimates for all the
tokens are combined to give a score between 0 and 1 -- 0 means the
message is not likely to be a spam, 1 means it is.  That's fine, but
what happens if a token's never been seen before?  That's where x comes
in: it's a "first guess" at what the presence of an unknown token
means, in terms of its contribution to the score.  It's the value used
as the likelihood estimate when a new token is found.

Obtaining a value for x is easy: bogoutil, a program that's supplied
with bogofilter, will calculate it for you.

Assuming your bogofilter wordlists are in ~/.bogofilter, run
  mkdir scratch
  cp ~/.bogofilter/*.db scratch
  bogoutil -R scratch

This will put a token .ROBX in scratch/spamlist.db that has the value
calculated for x, multiplied by 1,000,000.  To view it, do
  bogoutil -w scratch .ROBX
and you will see two lines something like
                         spam   good
  .ROBX                425175      0
which means that the x value calculated for your training database was
0.425175.  If it looks reasonable (not too far from 0.5), you can run
  bogoutil -R ~/.bogofilter
to install the calculated value so bogofilter will use it from then on.

The value of x that bogoutil calculates is just the average of
  p(w) = badcount / (goodcount + badcount)
for every word in your training set that appears at least 10 times in
one or both wordlists (i.e. badcount + goodcount >= 10).

In the last two paragraphs I've I oversimplified, for the sake of
explaining the basic concept clearly: in real life, you have to scale
the counts somehow.  If you had exactly the same number of messages
contributing to your good and bad wordlists, the formula for p(w) given
above would be ok; but we actually have to use
  p(w) = (badcount / badlist_msgcount) /
         (badcount / badlist_msgcount + goodcount / goodlist_msgcount)
where badlist_msgcount is the number of spam messages that were fed
into the training set, and goodlist_msgcount is the number of nonspams
used in training.

An equivalent way of calculating x, that's a little easier to read, is:
  scalefactor = badlist_msgcount / goodlist_msgcount
  p(w) = badcount / (badcount + goodcount * scalefactor)

In either case, x is the average of those p(w) values.


ROBINSON'S s

With a count of zero for a given token, we have only x to go on, so
that's what we use as the likelihood estimate for that token.  The
question then arises, what if we have seen the token before, but only a
few times?  Statistical variation will result in the ratio of two small
numbers being rather unreliable, so perhaps we ought to compromise
between that ratio and our "first guess" value.  This is what the
Robinson method does.

The compromise works like this: a parameter s is defined, that serves
as a weighting factor; the larger the value of s, the more importance
is given to x in the presence of low token counts.  The token count is
  n = badcount + goodcount
and the p(w) value for the token is modified as follows:
  f(w) = (s * x + n * p(w))/(s + n)
As you see, if n is zero, f(w) is just x, as we have been saying all
along; but if n is nonzero and small, the s * x term will influence the
f(w) value.

Parameters s and x are only important when the counts are small, and
the value of s reflects what we think of as "small." If s is large,
then when counts are small we trust our x value more than we do the
p(w); if s is small, we give more weight to p(w) and less to x.

So how big should s be?  Gary Robinson suggested, on a theoretical
basis, that we start with a value of 1; it might be worth trying values
in the range of 0.01 to 10.  Making s smaller than 0.01 or so is a bad
idea, because of what happens when a token is encountered that's been
seen in spam before, but never in nonspam, or vice versa.  In that
case, p(w) is exactly 1 or 0, and f(w) will vary greatly as the value
of s diminishes.  As an example, one spam that had about ten such
tokens, out of 78 that contributed to the spam score, scored 0.999 with
s set to 0.001, and was found to score 0.505 when s was 1.0e-8!

Choosing a value for s is a matter of trial and error.  Experience
suggests that 0.1 might be a better starting value than 1, at least if
your training database is moderately large.


THE MINIMUM DEVIATION

MIN_DEV is another thing you might need to vary.  Paul Graham's
original method was based on looking only at the fifteen tokens with
p(w) values farthest from 0.5 (nearest to 0 or 1).  We don't do it that
way; instead, we set a minimum deviation from 0.5, and look at all the
tokens with f(w) values farther away than that.  If the minimum is set
to zero, every token in the message contributes its spammishness value
f(w) to the final calculation.  It might save time, and perhaps improve
discrimination, to ignore tokens with f(w) values near 0.5, since those
tokens obviously make less difference to the outcome of the
calculation.  It seems helpful, at least once the training database is
a good size, to use a MIN_DEV value between 0.3 and 0.46.  You might
try 0.42 or 0.44 initially.

Note that higher values of MIN_DEV accentuate the distortion caused by
small s, because tokens appearing in only one of the two wordlists will
(if present) be a higher proportion of the total number of tokens
considered.


THE SPAM AND NONSPAM CUTOFFS

Most spam messages will have scores very close to 1 when the Fisher
method of combining the likelihood estimates is applied, and most
nonspam will have scores very close to 0.  In between, there is a grey
area, where messages will fall that have both spammish and nonspammish
characteristics.  The spam and nonspam cutoffs are thresholds:
bogofilter classes messages with scores below the nonspam cutoff as
nonspam, and those with scores at or above the spam cutoff as spam. 
Messages with scores between the two cutoff values are classed as
uncertain.  (Usually, mail administrators will want to deliver mail
classed as uncertain, even though some of it may well be spam.)

The best value for the spam cutoff depends strongly on the values of s
and MIN_DEV that are being used in the calculation.  For that reason,
the way to test a given pair of s and MIN_DEV values is the following:

1. Determine a suitable level of false positives (nonspams classified
as spam); this will probably need to fall somewhere in the range of
0.05 to 0.3 percent (the lower the better, of course, except that
lowering the false-positive target too much gives too many false
negatives).

2. Apply the parameters to classify a number of known nonspam messages.
From the distribution of scores, pick a spam cutoff value that will
give the selected proportion of false positives.

3. Apply the parameters and the chosen spam cutoff to classify a number
of spam messages; the parameter set should be judged on how few false
negatives (spam messages classed as uncertain or nonspam) are obtained.

The value for the nonspam threshold should be such that no more than
about one in ten thousand spams is classified as nonspam.  A value of
0.2 to 0.25 might be a good starting point for use with the recommended
s and MIN_DEV (0.1 and 0.44 respectively).


HOW OFTEN TO TUNE

It's probably wise to review the spam cutoff frequently; if the
false-negative count is gratifyingly low, or if false positives are
occurring, increasing the cutoff will reduce the false-positive rate. 
If, on the other hand, there are absolutely no false positives but the
false-negative rate is high, lowering the cutoff a bit may improve
discrimination.

How often to tune the other parameters depends on how fussy you are
about optimizing performance.  If you're eager to get the best
discrimination you can, here are my recommendations:  In the early
stage, while the training database is still small and growing rapidly,
it's probably wise to experiment with tuning x, s and MIN_DEV once a
month or so.  Once the training database is a good size (over 5000
spams and 5000 nonspams), this can be reduced to quarterly or
half-yearly.


A NOTE ON TRAINING

Bogofilter's ability to distinguish accurately between spam and nonspam
messages depends on the quality of its training database.  Here is a
way of maximizing that quality with relatively little effort.

When starting afresh, feed every spam and every nonspam you get into
bogofilter.  Do not use bogofilter's -u option to do this: there will
be far too many errors when your training database is small.  Instead,
classify messages manually and train bogofilter with the -n and -s
options appropriately.  You can do it in batches: if you work with
standard mbox files, use a mail reader to move spam and nonspam into
separate files, then do "bogofilter -s <spambox" and "bogofilter -n
<nonspambox" to register the messages.

To find out how many messages have gone into your spamlist and
goodlist, assuming they're kept in ~/.bogofilter, do
  bogoutil -w ~/.bogofilter .MSG_COUNT

Once you've accumulated about 5,000 messages in one of the lists, you
need to let the other catch up unless they're growing at about the same
rate.  Stop adding every message to the larger of the two lists, and
instead, add only messages that bogofilter got wrong or was unsure
about.  To do this, you need to start classifying messages into 4 sets
instead of 2: spam, nonspam, unsures that were actually spam, and
unsures that were nonspam.

Once the lists are both over 5,000 and fairly similar in message count,
you can train only on errors and unsures.  By this time there should be
very few errors (spams classed as nonspam or vice versa), but there
will still be a proportion of unsure spam and unsure nonspam messages. 
Training on these will keep bogofilter working well, as you're telling
it what it needs to learn, and not so much of what it already knows. 
If one list grows faster than the other, extra (correctly classified)
messages may be added from time to time to equalize them again; try to
keep the smaller list's message count at least two thirds of the
larger's.

The use of bogofilter's -u option is convenient but dangerous.  Even
when well trained, bogofilter _will_ misclassify a small number of
messages.  With the -u option, these mistakes are entered into the
database and decrease its accuracy.  More mistakes result and the
process snowballs.  You therefore need to review and correct the
databases frequently, and in the interval between such reviews,
bogofilter's effectiveness will keep falling off.  The method described
above has the advantage that (human error excepted) no wrongly
classified messages are ever entered into the database, and if you have
to leave it for a time without updating it, its effectiveness doesn't
diminish.

				*  *  *

Much of the advice given here arises out of experiments reported on the
author's bogofilter web pages; in particular, the report at
http://www.bgl.nu/bogofilter/smindev3.html might interest those who'd
like more information about the basis of bogofilter tuning.

Thanks go to David Relson for reviewing this document in draft and
suggesting several improvements.