Whole document tree
This document is intended to answer frequently asked questions about bogofilter.
What is bogofilter?
Bogofilter is a fast Bayesian spam filter along the lines suggested by Paul Graham in his article A Plan For Spam. Bogofilter uses Gary Robinson's geometric-mean algorithm with the Fisher's method modification to classify email as spam or non-spam.
The bogofilter home page at SourceForge is the central clearinghouse for bogofilter resources.
Bogofilter was started by Eric S. Raymond on August 19, 2002. It gained popularity in September 2002, and a number of other authors have started to contribute to the project.
The NEWS file describes bogofilter's version history.
There are currently four mailing lists for bogofilter:
What does bogofilter's verbose output mean?
Bogofilter can instructed to display information on the scoring of a message by running it with flags "-v", "-vv", "-vvv", or "-R".
How can I use SpamAssassin to train Bogofilter?
If you have a working SpamAssassin installation (or care to create one), you can use its return codes to train bogofilter. The easiest way is to create a script for your MDA that runs SpamAssassin, tests the spam/non-spam return code, and runs bogofilter to register the message as spam (or non-spam). The sample procmail recipe below shows one way to do this:
BOGOFILTER = "/usr/bin/bogofilter" BOGOFILTER_DIR = "training" SPAMASSASSIN = "/usr/bin/spamassassin" :0 HBc * ? $SPAMASSASSIN -e #spam yields non-zero #non-spam yields zero | $BOGOFILTER -n -d $BOGOFILTER_DIR #else (E) :0Ec | $BOGOFILTER -s -d $BOGOFILTER_DIR :0fw | $BOGOFILTER -p -e :0: * ^X-Bogosity:.Yes spam :0: * ^X-Bogosity:.No non-spam
What can I do about asian spam?
Many people get unsolicited email using asian language charsets. Since they don't know the languages and don't know people there, they assume it's spam.
The good news is that bogofilter does detect them quite successfully. The bad news is that this can be expensive. You have basically two choices:
How do I manually query the database
To find the spam and ham counts for a token (word) use bogoutil's '-w' option. For example, "bogoutil -w $BOGOFILTER_DIR example.com" gives the good and bad counts for "example.com".
If you want the spam score in addition to the spam and ham counts for a token (word) use bogoutil's '-p' option. For example, "bogoutil -p $BOGOFILTER_DIR example.com" gives the good and bad counts for "example.com".
To find out how many messages are in your wordlists query the special token .MSG_COUNT, i.e., run command "bogoutil -w $BOGOFILTER_DIR .MSG_COUNT" to see the counts for the spam and ham word lists.
To tell how many tokens are in your wordlists pipe the output of bogoutil's dump command to command "wc", i.e. use "bogoutil -d $BOGOFILTER_DIR/spamlist.db | wc -l " to display the count for the spamlist and use "bogoutil -d $BOGOFILTER_DIR/goodlist.db | wc -l" to display the count for the goodlist.
How can I tell if my word lists are corrupted?
If you think your word lists are hosed, you can see what BerkeleyDB thinks by running:
db_verify spamlist.db db_verify goodlist.db
If there is a problem, you may be able to recover some (or all) of the tokens and their counts with the following commands:
bogoutil -d spamlist.db | bogoutil -l spamlist.db.new
db_dump -r spamlist.db > spamlist.txt db_load spamlist.new < spamlist.txt
How do I get bogofilter working on Solaris, BSD, etc?
$ cd build_unix $ sh ../dist/configure $ make # make install
Next, download a portable version of bogofilter.
Unpack it, and then do:
$ ./configure --with-db=/usr/local/BerkeleyDB.4.1 $ make # make install-strip
You will either want to put a symlink to libdb.so in /usr/lib, or use a modified LD_LIBRARY_PATH environment variable before you start bogofilter.
Note that some make versions shipped with Solaris break when you try to build bogofilter outside of its source directory. Either build in the source directory (as suggested above) or use GNU make (gmake).
The FreeBSD ports and packages carry very recent versions of bogofilter. This approach uses the highly recommended portupgrade and cvsup software packages. To install these two fine pieces, type (you need to do this only once):
# pkg_add -r portupgrade cvsup
To install or upgrade bogofilter, just upgrade your portstree using cvsup, then type:
# portupgrade -N bogofilter
See the file doc/programmer/README.hp-ux in the source distribution.
Can I share word lists over NFS?
If all you're just reading from them, there are no problems. When you're updating them, you need to use the correct file locking to avoid data corruption. When you compile bogofilter, you will need to verify that the configure script has set "#define HAVE_FCNTL 1" in your config.h file. Popular UNIX operating systems will all support this. If you are running an unusual, or an older version of an operating system, make sure it supports fcntl(). If "#define HAVE_FCNTL 1" is set, then comment out "#define HAVE_FLOCK 1" so that the locking system uses fcntl() locking instead of the default of flock() locking. If your system does not support fcntl, then you will not be able to share word list files over NFS without the risk of data corruption.
Next, make sure you have NFS set up properly, with "lockd" running. Refer to your NFS documentation for more information about running "lockd" or "rpc.lockd". Most operating systems with NFS turn this on by default.
Why does bogofilter give return codes like 0 and 256 when it's run from inside a program?
Likely the return codes are being reformatted by waitpid(2). Use WEXITSTATUS(status) in sys/wait.h, or comparable macro, to get the correct value.
Now that I've upgraded to 0.11 why are my scripts broken?
With version 0.11 bogofilter's options for registering mail as ham or spam have been changed. They now allow registering (or unregistering) messages in the ham and spam word lists. Prior to this, there was no way to unregister a message from a word list (without registering it in the other word list).
Bogofilter has four registration options - '-s', '-n', '-S', and '-N'. With the release of version 0.11 the meaning of '-S' and '-N' has been changed to allow unregistering messages from the word lists. Here's what the four options mean:
Prior to version 0.11, the '-S' option was used to move a message from the ham word list to the spam word list, i.e. there were two actions. Now with 0.11 each of the two actions is invoked by its own option. To get the same effect as the old '-S', you should use '-N -s' (or '-Ns' which means the same thing).
Similarly, the old '-N' option is now '-Sn' (or '-S -n').
MDA scripts typically use '-s' and '-n' and don't need to change. Other scripts which use '-S' and '-N' for fixing registration errors do need to be changed.