TUNING BOGOFILTER'S ROBINSON-FISHER METHOD -- a HOWTO (Greg Louis, April 2003) CONTENTS Introduction Robinson's x Robinson's s The minimum deviation The spam and nonspam cutoffs How often to tune A note on training INTRODUCTION Currently, bogofilter offers a choice of three classification methods: the original as proposed by Paul Graham and implemented in bogofilter by Eric S. Raymond; a variation proposed by Gary Robinson and implemented in bogofilter by Greg Louis; and a further variation, also proposed by Gary Robinson, which uses Fisher's method (Fisher, R. A., 1950: Statistical Methods for Research Workers, pp. 99ff. London: Oliver and Boyd) of combining probabilities; for bogofilter, your author has implemented this one too. Both of Gary Robinson's proposed classification methods work better than the original. For optimal results, they (and the original) require some tuning. As distributed, bogofilter attempts to supply good starting values for the tunable parameters. Since the optimum values depend on the size and content of the wordlists at _your_ installation, the best results can only be determined by some experimentation using _your_ wordlists. After several thousand each of spam and nonspam messages have been fed to bogofilter for training, this experimentation can be well worthwhile: you may be able to cut the number of spams that are still getting through to half or less. The purpose of this document is to explain how to adjust bogofilter's parameters for best results with the Robinson-Fisher classification method. The author recommends using only this method, not Robinson-geometric-mean or Graham, for spam filtering. With Robinson's changes as implemented in bogofilter, there are four things to tune, and they're highly interdependent, as explained below: ROBINSON'S x First off, you should determine the value of x appropriate to your training database. The way bogofilter works, summarizing briefly, is that the message being classified is separated into "tokens" -- words, IP addresses and other logical units of information. Each token is looked up in the wordlists that make up the training database. The number of times it's been seen in a spam message is divided by the total number of times it's been seen, and that gives an indication of the likelihood that the token is in a spam message. The likelihood estimates for all the tokens are combined to give a score between 0 and 1 -- 0 means the message is not likely to be a spam, 1 means it is. That's fine, but what happens if a token's never been seen before? That's where x comes in: it's a "first guess" at what the presence of an unknown token means, in terms of its contribution to the score. It's the value used as the likelihood estimate when a new token is found. Obtaining a value for x is easy: bogoutil, a program that's supplied with bogofilter, will calculate it for you. Assuming your bogofilter wordlists are in ~/.bogofilter, run mkdir scratch cp ~/.bogofilter/*.db scratch bogoutil -R scratch This will put a token .ROBX in scratch/spamlist.db that has the value calculated for x, multiplied by 1,000,000. To view it, do bogoutil -w scratch .ROBX and you will see two lines something like spam good .ROBX 425175 0 which means that the x value calculated for your training database was 0.425175. If it looks reasonable (not too far from 0.5), you can run bogoutil -R ~/.bogofilter to install the calculated value so bogofilter will use it from then on. The value of x that bogoutil calculates is just the average of p(w) = badcount / (goodcount + badcount) for every word in your training set that appears at least 10 times in one or both wordlists (i.e. badcount + goodcount >= 10). In the last two paragraphs I've I oversimplified, for the sake of explaining the basic concept clearly: in real life, you have to scale the counts somehow. If you had exactly the same number of messages contributing to your good and bad wordlists, the formula for p(w) given above would be ok; but we actually have to use p(w) = (badcount / badlist_msgcount) / (badcount / badlist_msgcount + goodcount / goodlist_msgcount) where badlist_msgcount is the number of spam messages that were fed into the training set, and goodlist_msgcount is the number of nonspams used in training. An equivalent way of calculating x, that's a little easier to read, is: scalefactor = badlist_msgcount / goodlist_msgcount p(w) = badcount / (badcount + goodcount * scalefactor) In either case, x is the average of those p(w) values. ROBINSON'S s With a count of zero for a given token, we have only x to go on, so that's what we use as the likelihood estimate for that token. The question then arises, what if we have seen the token before, but only a few times? Statistical variation will result in the ratio of two small numbers being rather unreliable, so perhaps we ought to compromise between that ratio and our "first guess" value. This is what the Robinson method does. The compromise works like this: a parameter s is defined, that serves as a weighting factor; the larger the value of s, the more importance is given to x in the presence of low token counts. The token count is n = badcount + goodcount and the p(w) value for the token is modified as follows: f(w) = (s * x + n * p(w))/(s + n) As you see, if n is zero, f(w) is just x, as we have been saying all along; but if n is nonzero and small, the s * x term will influence the f(w) value. Parameters s and x are only important when the counts are small, and the value of s reflects what we think of as "small." If s is large, then when counts are small we trust our x value more than we do the p(w); if s is small, we give more weight to p(w) and less to x. So how big should s be? Gary Robinson suggested, on a theoretical basis, that we start with a value of 1; it might be worth trying values in the range of 0.01 to 10. Making s smaller than 0.01 or so is a bad idea, because of what happens when a token is encountered that's been seen in spam before, but never in nonspam, or vice versa. In that case, p(w) is exactly 1 or 0, and f(w) will vary greatly as the value of s diminishes. As an example, one spam that had about ten such tokens, out of 78 that contributed to the spam score, scored 0.999 with s set to 0.001, and was found to score 0.505 when s was 1.0e-8! Choosing a value for s is a matter of trial and error. Experience suggests that 0.1 might be a better starting value than 1, at least if your training database is moderately large. THE MINIMUM DEVIATION MIN_DEV is another thing you might need to vary. Paul Graham's original method was based on looking only at the fifteen tokens with p(w) values farthest from 0.5 (nearest to 0 or 1). We don't do it that way; instead, we set a minimum deviation from 0.5, and look at all the tokens with f(w) values farther away than that. If the minimum is set to zero, every token in the message contributes its spammishness value f(w) to the final calculation. It might save time, and perhaps improve discrimination, to ignore tokens with f(w) values near 0.5, since those tokens obviously make less difference to the outcome of the calculation. It seems helpful, at least once the training database is a good size, to use a MIN_DEV value between 0.3 and 0.46. You might try 0.42 or 0.44 initially. Note that higher values of MIN_DEV accentuate the distortion caused by small s, because tokens appearing in only one of the two wordlists will (if present) be a higher proportion of the total number of tokens considered. THE SPAM AND NONSPAM CUTOFFS Most spam messages will have scores very close to 1 when the Fisher method of combining the likelihood estimates is applied, and most nonspam will have scores very close to 0. In between, there is a grey area, where messages will fall that have both spammish and nonspammish characteristics. The spam and nonspam cutoffs are thresholds: bogofilter classes messages with scores below the nonspam cutoff as nonspam, and those with scores at or above the spam cutoff as spam. Messages with scores between the two cutoff values are classed as uncertain. (Usually, mail administrators will want to deliver mail classed as uncertain, even though some of it may well be spam.) The best value for the spam cutoff depends strongly on the values of s and MIN_DEV that are being used in the calculation. For that reason, the way to test a given pair of s and MIN_DEV values is the following: 1. Determine a suitable level of false positives (nonspams classified as spam); this will probably need to fall somewhere in the range of 0.05 to 0.3 percent (the lower the better, of course, except that lowering the false-positive target too much gives too many false negatives). 2. Apply the parameters to classify a number of known nonspam messages. From the distribution of scores, pick a spam cutoff value that will give the selected proportion of false positives. 3. Apply the parameters and the chosen spam cutoff to classify a number of spam messages; the parameter set should be judged on how few false negatives (spam messages classed as uncertain or nonspam) are obtained. The value for the nonspam threshold should be such that no more than about one in ten thousand spams is classified as nonspam. A value of 0.2 to 0.25 might be a good starting point for use with the recommended s and MIN_DEV (0.1 and 0.44 respectively). HOW OFTEN TO TUNE It's probably wise to review the spam cutoff frequently; if the false-negative count is gratifyingly low, or if false positives are occurring, increasing the cutoff will reduce the false-positive rate. If, on the other hand, there are absolutely no false positives but the false-negative rate is high, lowering the cutoff a bit may improve discrimination. How often to tune the other parameters depends on how fussy you are about optimizing performance. If you're eager to get the best discrimination you can, here are my recommendations: In the early stage, while the training database is still small and growing rapidly, it's probably wise to experiment with tuning x, s and MIN_DEV once a month or so. Once the training database is a good size (over 5000 spams and 5000 nonspams), this can be reduced to quarterly or half-yearly. A NOTE ON TRAINING Bogofilter's ability to distinguish accurately between spam and nonspam messages depends on the quality of its training database. Here is a way of maximizing that quality with relatively little effort. When starting afresh, feed every spam and every nonspam you get into bogofilter. Do not use bogofilter's -u option to do this: there will be far too many errors when your training database is small. Instead, classify messages manually and train bogofilter with the -n and -s options appropriately. You can do it in batches: if you work with standard mbox files, use a mail reader to move spam and nonspam into separate files, then do "bogofilter -s