README for bogotune version 0.3 (How to tune bogofilter with minimum effort) This document describes a script called bogotune that will completely automate the process of finding good parameters for bogofilter, provided the user meets a few minimum requirements (see Prerequisites below). There are six parameters you can tweak that may affect the performance, accuracy or operating convenience of bogofilter. They are: 1. The database cache size (performance) 2. Robinson's x parameter (accuracy) 3. The minimum deviation (accuracy) 4. Robinson's s (accuracy) 5. The spam cutoff (accuracy) 6. The nonspam cutoff (convenience) The bogotune script will use your training database and some spam and nonspam message files furnished by you to estimate the values you need to assign to these six parameters. Prerequisites: 1. You must have a bogofilter training database built from no fewer than 2,000 spams and 2,000 nonspams (bigger is better), and the ratio of spams to nonspams must be between 0.2 and 5 (closer to 1 is better). 2. You must have at least 500 nonspams and 500 spams that have not been used in the training database. This is a minimum, but results will be much more reliable (though the run will take longer) if you can use several thousand (or even one or two myriad) of each. 3. You must be using bogofilter version 0.13.7.1 or later, with the Robinson-Fisher algorithm. Programs bogofilter, bogoutil and bogolexer must all be in your execution path. If it's version 0.13.6.3 you're using, you need to apply the patch supplied with bogotune; cd to the base of the bogofilter source tree, copy 0.13.6.3.patch into that directory; then run the command "patch -p0 <0.13.6.3.patch" and then the command "make install". 4. You must have perl on your system. If /usr/bin/perl is not the path to a valid perl executable, change the first line of bogotune accordingly. 5. You will need formail (supplied with procmail) to create message-count files from mbox-format ones. Installation: 1. Copy bogotune and its helper utility, bogol, to somewhere in your execution path. The bogol script is used by formail to create message-count files. 2. If you wish, copy the man pages to the appropriate location (typically /usr/man/man1 or /usr/share/man/man1). Tuning: Run the command bogotune -s /absolute/paths/to/one/or/more/spam/files \ -n /absolute/paths/to/one/or/more/nonspam/files The script needs to know where to find your training database. The default is ~/.bogofilter, or $BOGOFILTER_DIR if that's defined. Otherwise, use the absolute path to the directory as the first argument in the command line, e.g. bogotune /var/bogo -s ... To see details of the results obtained during the scans, put -v in the command line. Then, instead of displaying a progress bar, bogotune will print a line like 0.0316 0.400 0.605 0.505781 46 274 for each combination of parameters tried, where the numbers are the values of s, min_dev, x, the spam cutoff and the counts of false positives and false negatives, respectively. If you want a specific configuration file to be used by bogofilter during the scan, put -c /path/to/that/file in the bogotune command line; if you don't want any configuration file, use -C. The message files may be in MH, mbox or msg-count format, but they must all be of the same type; don't mix them. If they're not in msg-count format, you'll need enough free disk space so bogotune can create msg-count files to use in the scans. (I _think_ bogotune should work with maildir format as well, but this has not been tested; feedback would be much appreciated.) If bogotune aborts, there may be leftover files named btxxx in the directory from which bogotune ran; the xxx stands for some number of more or less random digits (actually the pid of the bogotune process). Running bogotune generally takes quite a while. You will usually see output that ends like this: Recommendations: ---cut--- db_cachesize=10 robx=0.503238 min_dev=0.040 robs=0.0178 spam_cutoff=0.81 # for 0.01% false positives; expect 6.79% false neg. #spam_cutoff=0.69 # for 0.05% false positives; expect 3.07% false neg. #spam_cutoff=0.611 # for 0.1% false positives; expect 2.81% false neg. #spam_cutoff=0.563 # for 0.2% false positives; expect 1.84% false neg. ham_cutoff=0.312 ---cut--- Tuning complete. These can be pasted into your bogofilter.cf file (but choose only one spam_cutoff value). Normally, up to four possible spam cutoffs are suggested, as shown above, to give you an opportunity to judge of the tradeoff between false positives and false negatives. If there are too few messages in your test files, some of the suggestions may be omitted. Also, if it's not practicable to get a false-positive rate of 0.2% or less, only one suggestion will be provided. The values will, of course, differ from those shown here, depending on the composition of your training and test database files. Install the parameter values, with your chosen spam cutoff, in your bogofilter.cf file and enjoy. When setting the cache size, be careful not to use a larger value than bogotune suggests, unless you can afford to go to three times the suggested value; in between, serious performance degradation may result. If you want to use a smaller cache than suggested, that's ok; it will slow bogofilter down, but not catastrophically (don't set it to zero, though, if you're using the single-list version of bogofilter; that will be a bit too slow). If your current bogofilter parameters are giving near-optimal results, bogotune may report something like: Recommended cache size is 17 Mbytes. Calculating false-positive target... Very few high-scoring nonspams in this data set. Use these settings (only min_dev may have changed): robx = 0.600000 (6.00e-01) robs = 0.021500 (2.15e-02) min_dev = 0.020000 (2.00e-02) ham_cutoff = 0.400000 (4.00e-01) spam_cutoff = 0.800000 (8.00e-01) Tuning aborted. In that case, you just need to consider setting min_dev and the cache size. Usually, bogotune's estimates of false negatives to expect are on the conservative side; you might find you do a bit better in production. NOTES 1. When determining parameter values, bogotune usually needs to set a higher false-positive target than would be desirable in production. Don't panic: it will suggest a more appropriate value when it's finished scanning. 2. The quality of your training database and your spam and nonspam message files (i.e, their freedom from contamination with wrongly classified messages) can make a big difference to the outcome of the bogotune run. Bogotune is not very robust when confronted with garbage input.