Whole document tree
    

Whole document tree

Bogofilter FAQ

Bogofilter FAQ

Official Versions: In English or French
Maintainer: David Relson <relson@osagesoftware.com>

This document is intended to answer frequently asked questions about bogofilter.


What is bogofilter?

Bogofilter is a fast Bayesian spam filter along the lines suggested by Paul Graham in his article A Plan For Spam. Bogofilter uses Gary Robinson's geometric-mean algorithm with the Fisher's method modification to classify email as spam or non-spam.

The bogofilter home page at SourceForge is the central clearinghouse for bogofilter resources.

Bogofilter was started by Eric S. Raymond on August 19, 2002. It gained popularity in September 2002, and a number of other authors have started to contribute to the project.

The NEWS file describes bogofilter's version history.


Bogo-what?

Bogofilter is some kind of a bogometer or bogon filter, i.e., it tries to identify bogus mail by measuring the bogosity.


Mailing Lists

There are currently four mailing lists for bogofilter:

List Address Links Description
bogofilter-announce@aotto.com [subscribe] [archive] An announcement-only list where new versions are announced.
bogofilter@aotto.com [subscribe] [archive] A discussion list where any conversation about bogofilter may take place.
bogofilter-dev@aotto.com [subscribe] [archive] A list for sharing patches, development, and technical discussions.
bogofilter-cvs@lists.sourceforge.net [subscribe] [archive] Mailing list for announcing code changes to the CVS archive.

What does bogofilter's verbose output mean?

Bogofilter can instructed to display information on the scoring of a message by running it with flags "-v", "-vv", "-vvv", or "-R".

  • Using "-v" causes bogofilter to generate the "X-Bogosity:" header line, i.e.
      X-Bogosity: No, tests=bogofilter, spamicity=0.500000
     
    
  • Using "-vv" causes bogofilter to generate a histogram, i.e.
      X-Bogosity: No, tests=bogofilter, spamicity=0.500000
              int  cnt    prob   spamicity  histogram
             0.00   29  0.000209  0.000052  #############################
             0.10    2  0.179065  0.003425  ##
             0.20    2  0.276880  0.008870  ##
             0.30   18  0.363295  0.069245  ##################
             0.40    0  0.000000  0.069245  
             0.50    0  0.000000  0.069245  
             0.60   37  0.667823  0.257307  #####################################
             0.70    5  0.767436  0.278892  #####
             0.80   13  0.836789  0.334980  #############
             0.90   32  0.984903  0.499835  ################################
     
    

    Each row shows an interval, the count of tokens with scores in that interval, the average spam probability for those tokens, the message's spamicity score (for those tokens and all lesser valued tokens), and a bar graph corresponding to the token count.

    In the above histogram there are a lot of low scoring tokens and a lot of high scoring tokens. They "balance" one another to give the spamicity score of 0.5000

  • Using "-vvv" produces a list of all the tokens in the messages with information on each one, i.e.
      X-Bogosity: No, tests=bogofilter, spamicity=0.500000
                            n    pgood     pbad      fw     U
      "which"              10  0.208333  0.000000  0.000041 +
      "own"                 7  0.145833  0.000000  0.000059 +
      "having"              6  0.125000  0.000000  0.000069 +
      ...
      "unsubscribe.asp"     2  0.000000  0.095238  0.999708 +
      "million"             4  0.000000  0.190476  0.999854 +
      "copy"                5  0.000000  0.238095  0.999883 +
      N_P_Q_S_s_x_md      138  0.00e+00  0.00e+00  5.00e-01
                               1.00e-03  4.15e-01  0.100
     
    
    The columns printed contain the following information:
    "…"
    the token in question
    n
    number of times this token was encountered in training
    pgood
    proportion of good messages that contained this token
    pbad
    proportion of spam messages that contained this token
    fw
    Robinson's weighted index, which combines pgood and pbad to give a value that will be close to zero if a message containing this token is likely to be non-spam and close to one if it's likely to be spam
    U
    '+' if this token contributes to the final bogosity value, '-' otherwise. A token is excluded when its score is closer to 0.5 than min_dev.

    The final lines show:

    • The cumulative results of the columns
    • The values of Robinson's s and x parameters and of min_dev
  • Using "-R" produces the "-vvv" output described above plus two additional columns:
    invfwlog
    logarithm of fw
    fwlog
    logarithm of (1-fw)

    The "-R" output is formatted for use with the R language for statistical computing. More information is available at The R Project for Statistical Computing.

How can I use SpamAssassin to train Bogofilter?

If you have a working SpamAssassin installation (or care to create one), you can use its return codes to train bogofilter. The easiest way is to create a script for your MDA that runs SpamAssassin, tests the spam/non-spam return code, and runs bogofilter to register the message as spam (or non-spam). The sample procmail recipe below shows one way to do this:

  BOGOFILTER     = "/usr/bin/bogofilter"
  BOGOFILTER_DIR = "training"
  SPAMASSASSIN  = "/usr/bin/spamassassin"

  :0 HBc
  * ? $SPAMASSASSIN -e
  #spam yields non-zero
  #non-spam yields zero
  | $BOGOFILTER -n -d $BOGOFILTER_DIR
  #else (E)
  :0Ec
  | $BOGOFILTER -s -d $BOGOFILTER_DIR

  :0fw
  | $BOGOFILTER -p -e

  :0:
  * ^X-Bogosity:.Yes
  spam

  :0:
  * ^X-Bogosity:.No
  non-spam

What can I do about asian spam?

Many people get unsolicited email using asian language charsets. Since they don't know the languages and don't know people there, they assume it's spam.

The good news is that bogofilter does detect them quite successfully. The bad news is that this can be expensive. You have basically two choices:

  • You can simply let bogofilter handle it. Just train bogofilter with the asian language messages identified as spam. Bogofilter will parse the messages as best it can and will add tokens to the spam wordlist. The wordlist will contain many tokens which don't make sense to you (since the charset cannot be displayed), but bogofilter can work with them and successfully identify asian spam.

    A second method is to use the "replace_nonascii_characters" config file option. This will replace high-bit characters, i.e. those between 0x80 and 0xFF, with question marks, '?'. This keeps the database much smaller. Unfortunately this conflicts with european language which have many accented vowels and consonant in the high-bit range.

  • If you are sure you will not receive any legitimate messages in those languages, you can kill them right away. This will keep the database smaller. You can do this with an MDA script.

    Here's a procmail recipe that will sideline messages written with asian charsets:

    ## Silently drop all asian language mail
    UNREADABLE='[^?"]*big5|iso-2022-jp|ISO-2022-KR|euc-kr|gb2312|ks_c_5601-1987'
    :0:
    * 1^0 $ ^Subject:.*=\?($UNREADABLE)
    * 1^0 $ ^Content-Type:.*charset="?($UNREADABLE)
    spam-unreadable
    
    :0:
    * ^Content-Type:.*multipart
    * B ?? $ ^Content-Type:.*^?.*charset="?($UNREADABLE)
    spam-unreadable
    

    With the above recipe, bogofilter will never see the message.


How do I manually query the database

To find the spam and ham counts for a token (word) use bogoutil's '-w' option. For example, "bogoutil -w $BOGOFILTER_DIR example.com" gives the good and bad counts for "example.com".

If you want the spam score in addition to the spam and ham counts for a token (word) use bogoutil's '-p' option. For example, "bogoutil -p $BOGOFILTER_DIR example.com" gives the good and bad counts for "example.com".

To find out how many messages are in your wordlists query the special token .MSG_COUNT, i.e., run command "bogoutil -w $BOGOFILTER_DIR .MSG_COUNT" to see the counts for the spam and ham word lists.

To tell how many tokens are in your wordlists pipe the output of bogoutil's dump command to command "wc", i.e. use "bogoutil -d $BOGOFILTER_DIR/spamlist.db | wc -l " to display the count for the spamlist and use "bogoutil -d $BOGOFILTER_DIR/goodlist.db | wc -l" to display the count for the goodlist.


How can I tell if my word lists are corrupted?

If you think your word lists are hosed, you can see what BerkeleyDB thinks by running:

db_verify spamlist.db
db_verify goodlist.db

If there is a problem, you may be able to recover some (or all) of the tokens and their counts with the following commands:

bogoutil -d spamlist.db | bogoutil -l spamlist.db.new

or with

db_dump -r spamlist.db > spamlist.txt
db_load spamlist.new < spamlist.txt

How do I get bogofilter working on Solaris, BSD, etc?

If you don't already have a v3.0+ version of BerkeleyDB, then download it, unpack it, and do these commands in the db directory:

$ cd build_unix
$ sh ../dist/configure
$ make
# make install

Next, download a portable version of bogofilter.

On Solaris

Unpack it, and then do:

$ ./configure --with-db=/usr/local/BerkeleyDB.4.1
$ make
# make install-strip

You will either want to put a symlink to libdb.so in /usr/lib, or use a modified LD_LIBRARY_PATH environment variable before you start bogofilter.

$ LD_LIBRARY_PATH=/usr/lib:/usr/local/lib:/usr/local/BerkeleyDB.4.1

Note that some make versions shipped with Solaris break when you try to build bogofilter outside of its source directory. Either build in the source directory (as suggested above) or use GNU make (gmake).

On FreeBSD

The FreeBSD ports and packages carry very recent versions of bogofilter. This approach uses the highly recommended portupgrade and cvsup software packages. To install these two fine pieces, type (you need to do this only once):

# pkg_add -r portupgrade cvsup
 

To install or upgrade bogofilter, just upgrade your portstree using cvsup, then type:

# portupgrade -N bogofilter
 

On HP-UX

See the file doc/programmer/README.hp-ux in the source distribution.


Can I share word lists over NFS?

If all you're just reading from them, there are no problems. When you're updating them, you need to use the correct file locking to avoid data corruption. When you compile bogofilter, you will need to verify that the configure script has set "#define HAVE_FCNTL 1" in your config.h file. Popular UNIX operating systems will all support this. If you are running an unusual, or an older version of an operating system, make sure it supports fcntl(). If "#define HAVE_FCNTL 1" is set, then comment out "#define HAVE_FLOCK 1" so that the locking system uses fcntl() locking instead of the default of flock() locking. If your system does not support fcntl, then you will not be able to share word list files over NFS without the risk of data corruption.

Next, make sure you have NFS set up properly, with "lockd" running. Refer to your NFS documentation for more information about running "lockd" or "rpc.lockd". Most operating systems with NFS turn this on by default.


Why does bogofilter give return codes like 0 and 256 when it's run from inside a program?

Likely the return codes are being reformatted by waitpid(2). Use WEXITSTATUS(status) in sys/wait.h, or comparable macro, to get the correct value.


Now that I've upgraded to 0.11 why are my scripts broken?

With version 0.11 bogofilter's options for registering mail as ham or spam have been changed. They now allow registering (or unregistering) messages in the ham and spam word lists. Prior to this, there was no way to unregister a message from a word list (without registering it in the other word list).

Bogofilter has four registration options - '-s', '-n', '-S', and '-N'. With the release of version 0.11 the meaning of '-S' and '-N' has been changed to allow unregistering messages from the word lists. Here's what the four options mean:

-s
means to register the current message as spam, i.e. add its tokens to spamlist.db
-n
means to register the current message as ham, i.e. add its tokens to goodlist.db
-S
means to unregister the current message from the spam word list, i.e. remove its tokens from spamlist.db
-N
means to unregister the current message from the ham word list, i.e. remove its tokens from goodlist.db

Prior to version 0.11, the '-S' option was used to move a message from the ham word list to the spam word list, i.e. there were two actions. Now with 0.11 each of the two actions is invoked by its own option. To get the same effect as the old '-S', you should use '-N -s' (or '-Ns' which means the same thing).

Similarly, the old '-N' option is now '-Sn' (or '-S -n').

MDA scripts typically use '-s' and '-n' and don't need to change. Other scripts which use '-S' and '-N' for fixing registration errors do need to be changed.