Manpages

Manpage of BOGOFILTER

BOGOFILTER

Section: User Commands (1)
Updated:
Index
Return to Main Contents
 

NAME

bogofilter - fast Bayesian spam filter  

SYNOPSIS

bogofilter [help options | classification options | registration
           options] [algorithm options] [general options]

where

help options are:

 [-V] [-Q]

classification options are:

 [-e] [-t] [-u] [-2] [-3] [-M] [-b] [-B filename ...] [-F] [-R] [algorithm
 options] [general options] [parameter options]

registration options are:

 | -n] [-S | -N] [algorithm options] [general options]

general options are:

 filename] [-C] [-d dir] [-k size] [-l] [-L tag] [-I filename] [-O filename]

algorithm options are:

 | -r | -f]

parsing options are:

 [-Ph/-PH] [-Pt/-PT]

parameter options are:

 [value] [,value][,value]] [-o [value] [,value]]

info options are:

 [-v] [-y date] [-D] [-x flags]

 

DESCRIPTION

Bogofilter is a Bayesian spam filter. In its normal mode of operation, it takes an email message or other text on standard input, does a statistical check against lists of "good" and "bad" words, and returns a status code indicating whether or not the message is spam. Bogofilter is designed with fast algorithms, uses the Berkeley DB for fast startup and lookups, coded directly in C, and tuned for speed, so it can be used for production by sites that process a lot of mail.

 

THEORY OF OPERATION

Bogofilter treats its input as a bag of tokens. Each token is checked against "good" and "bad" wordlists, which maintain counts of the numbers of times it has occurred in non-spam and spam mails. These numbers are used to compute the probability that a mail in which the token occurs is spam. After probabilities for all input tokens have been computed, a fixed number of the probabilities that deviate furtherest from average are combined using Bayes's theorem on conditional probabilities. If the computed probability that the input is spam exceeds a cutoff determined at compile time (currently 0.95, for the Robinson-Fisher algorithm), bogofilter returns 0, otherwise 1.

While this method sounds crude compared to the more usual pattern-matching approach, it turns out to be extremely effective. Paul Graham's paper A Plan For Spam: http://www.paulgraham.com/spam.html is recommended reading.

This program substantially improves on Paul's proposal by doing smarter lexical analysis. In particular, hostnames and IP addresses are retained as recognition features rather than broken up. Various kinds of MTA cruft such as dates and message-IDs are discarded so as not to bloat the word lists. Lex's Swiss-army-knife nature rises again.

Another seeming improvement is that this program offers Gary Robinson's suggested modifications (S and f(w) but not g(w)) to the calculations. These modifications are described in Robinson's paper Spam Detection: http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html.

Since then, Robinson and others have realized that the S calculation can be further optimized: if a vector of length k contains random, uniformly-distributed probabilities p, then -2 * sum(ln(p)) is distributed as chi-squared with 2n degrees of freedom. This is believed to be the most sensitive test of the hypothesis that the vector of probabilities is, in fact, uniformly distributed. Bogofilter now offers the option of applying this test (known as Fisher's method) to yield P(spam) and P(not spam), and using the difference as the "spamicity" score.

The input may be one message or many. Messages are broken up on "From " lines. The algorithm is relatively insensitive to message miscounts.

 

OPTIONS

Without command-line options, bogofilter returns 1 if the message is non-spam, 0 if it is spam. The non-spam wordfile is created if absent.

HELP OPTIONS

The -h option prints the help message and exits.

The -V option prints the version number and exits.

The -Q (query) option prints bogofilter's configuration, i.e. registration parameters, parsing options, bogofilter directory, etc.

CLASSIFICATION OPTIONS

The -p (passthrough) option writes a copy of the input mail to the output with an X-Bogosity header (in the style of SpamAssassin) inserted. The header will begin with "Yes" or "No" according as the mail is judged to be spam or non-spam respectively. Note: the memory consumption depends on whether the input file is regular and allows seek operations. Within these constraints, the file will be rewound and read a second time, without using much memory. If the input file however is not regular (for example, a pipeline or socket), then bogofilter will cache a copy if the entire mail in memory.

The -e (embed) option tells bogofilter to exit with code 0 even if the mail is not spam. This simplifies using bogofilter from procmail or maildrop.

The -t (terse) option tells bogofilter to print an abbreviated spamicity message containing 1 letter and the score. The letter will be "Y" to indicate spam and "N" to indicate non-spam.

The -u option tells bogofilter to register the message's text after classifying it as spam or non-spam. A spam message will be registered on the spamlist and a non-spam message on the goodlist. If using the Robinson-Fisher method and the classification is "unsure", the message will not be registered. Effectively this option runs bogofilter with the -s or -n flag, as appropriate. (Caution is urged in the use of this capability, as any classification errors bogofilter may make will be preserved and accumulated until corrected with the -Sn and -Ns option combinations.)

The -2 option tells bogofilter to binary classify the message as either ham or spam, and never as unsure. When this option is used with -u, a wordlist is always updated.

The -3 option tells bogofilter to use tristate classification for the message, i.e. classify the message as ham, spam, or unsure. This option is effective only if ham_cutoff is non-zero.

The -M option tells bogofilter to process its input as a mbox formatted file. If the -v or -t option is also given, a spamicity line will be printed for each message.

The -b (streaming bulk mode) option tells bogofilter to classify multiple messages whose names are read from stdin. If the -v or -t option is also given, bogofilter will print a line giving file name and classification information for each file.

The -Bfilename (bulk mode) option tells bogofilter to classify multiple messages named as files on the command line. If the -v or -t option is also given, bogofilter will print a line giving file name and classification information for each file.

The -F (force) ignores threshold values when printing spamicity statistics.

The -R option tells bogofilter to output an R data frame in text form on the standard output. See the section on integration with R, below, for further detail.

REGISTRATION OPTIONS

The -s option tells bogofilter to register the text presented on standard input as spam. The spam wordfile is created if absent.

The -n option tells bogofilter to register the text presented on standard input as non-spam.

Bogofilter doesn't detect if a message registered twice. If you do this by accident, the token counts will off by 1 from what you really want and the corresponding spam scores will be slightly off. Given a large number of tokens and messages in the wordlists, this doesn't matter. The problem _can_ be corrected by using the -S option or the -N option.

The -S option tells bogofilter to undo a prior registration of the same message as spam. If a message was incorrectly entered in the spam wordfile by '-n' or '-u' and you want to remove it from the spam wordfile and enter it in the non-spam wordfile, use options '-Sn'. If '-S' is used for a message that wasn't registered as spam, the counts will still be decremented.

The -N option tells bogofilter to undo a prior registration of the same message as non-spam. If a message was incorrectly entered in the non-spam wordfile by '-n' or '-u' and you want to remove it from the non-spam wordfile and enter it in the spam wordfile, then use '-Ns'. If '-N' is used for a message that wasn't registered as non-spam, the counts will still be decremented.

GENERAL OPTIONS

The -cfilename option tells bogofilter to read the config file named.

The -C option prevents bogofilter from reading configuration files.

The -d dir option allows you to set the directory under which wordlists will be found to dir. If omitted, the default directory will be $BOGOFILTER_DIR if BOGOFILTER_DIR is set and $HOME/.bogofilter otherwise.

The -k tag option sets the cache size for the BerkeleyDB subsystem. Properly sizing the cache improves bogofilter's performance. Run the bogotune script to determine the recommended size.

The -l option writes an informational line to the system log each time bogofilter is run. The information logged depends on how bogofilter is run.

The -L tag option configures a tag which can be included in the information being logged by the -l option, but it requires a custom format that includes the %l string for now. This option implies -l.

The -I filename option tells bogofilter to read its input from the specified file, rather than from stdin

The -O filename option tells bogofilter where to write its output in passthrough mode. Note that this only works when -p is explicitly given.

ALGORITHM OPTIONS

The Robinson-Fisher method is the default algorithm used for computing a message's spamicity score, unless bogofilter has been compiled without it, by using the --disable-robinson-fisher option to the configure script. The method to be used can be specified on the command line or in the configuration file.

The -g option selects the original Graham form of the calculation method.

The -r option selects the Robinson modifications to the calculation method.

The -f option selects the Robinson-Fisher modifications to the calculation method.

The configure script has options --disable-graham-method, --disable-robinson-method, and --disable-robinson-fisher so that bogofilter can be built to support a subset of the available methods.

PARSING OPTIONS

Bogofilter has three special parsing options which can be enabled (or disabled) at the user's discretion. The options ar of form -Px and -PX where x designates an option letter. For the parsing options, a lower case letter enables the option and an upper case letter disables it.

Options -Ph and -PH are for header line markup, i.e. whether to create special tags for header lines. When enable, tokens in "To:", "From:", "Return-Path:", and "Subject:" lines will be given special prefixes. Enabling this option increases bogofilter's accuracy.

Options -Pi and -PI are for ignoring case, i.e. whether to map upper case to lower case (or not). Disabling this option increases bogofilter's accuracy.

Options -Pt and -PT are for tokenizing the innards of 3 html tags, i.e. >a<, >img<, and >font<. Tokenizing these tags adds urls and font names to the message's tokens. Enabling this option increases bogofilter's accuracy.

PARAMETER OPTIONS

The -m [value][,value][,value] option allows setting the min_dev value and, optionally, the robs and robx values. If one value is supplied, then min_dev is set. If a comma followed by one value is supplied, then robs is set. With two values, both min_dev and robs are set; with three, mindev, robs and robx are set; and other combinations of values and commas behave as one would expect. Note the syntax is misleading, at least one of the values MUST be present, and the commas determine what value(s) will be set. Note: spaces are not allowed after the comma.

The -o [value][,value] option allows setting the spam_cutoff value and, optionally, the ham_cutoff value. If one value is supplied, then spam_cutoff is set. If a comma followed by one value is supplied, then ham_cutoff is set. With two values, both spam_cutoff and ham_cutoff are set. Note the syntax is misleading, at least one of the values MUST be present, and the comma determines whether it is to set the spam or the ham cutoff. Note: spaces are not allowed after the comma.

INFO OPTIONS

The -q (quiet) suppresses warning messages.

The -v option produces a report to standard output on bogofilter's analysis af the input. Each additional v will increase the verbosity of the output, up to a maximum of 4. With -vv, the report lists the tokens with highest deviation from a mean of 0.5 association with spam.

Option -y date is specifies the date to give to tokens that don't have dates.

The -D option redirects debug output to stdout.

The -x flags option allows setting of debug flags for printing debug information.

 

ENVIRONMENT

Bogofilter will initialize its data base directory to$BOGOFILTER_DIR if BOGOFILTER_DIR is set. If it is not set, bogofilter will use $HOME/.bogofilter instead. If neither BOGOFILTER_DIR nor HOME is set, the -d dir option must be present.

 

CONFIGURATION

The bogofilter command line allows setting of many options that determine how bogofilter operates. File /etc/bogofilter.cf can be used to set additional parameters that affect its operation. File /etc/bogofilter.cf.example has samples of all of the parameters. Status and logging messages can be customized for each site (see /etc/bogofilter.cf.example).

 

RETURN VALUES

0 for spam; 1 for non-spam; 2 for I/O or other errors.

If both -p and -e are used, the return values are: 0 for spam or non-spam; 2 for I/O or other errors.

Error 2 usually means that the wordlist files bogofilter wants to read at startup are missing or the hard disk has filled up in -p mode.

 

INTEGRATION WITH OTHER TOOLS

Use with Procmail

The following procmail rule will take mail on stdin and direct it to Mail/spam if bogofilter thinks it's spam:

:0HB:
* ? bogofilter
Mail/spam


 and this similar rule will also register the tokens in the mail according to the bogofilter classification: 

:0HB:
* ? bogofilter -u
Mail/spam


 

If bogofilter fails (returning 2) the message will be treated as non-spam.

The following recipe (a) spam-bins anything that bogofilter rates as spam, (b) adds the words in messages rated as spam to the spam wordlist, and (c) adds the words in messages rated as non-spam to the non-spam wordlist. With this in place, it will normally only be necessary for the user to intervene (with -Ns or -Sn) when bogofilter miscategorizes something.


  
# filter mail through bogofilter, tagging it as spam and
# updating the word lists

:0fw
| bogofilter -u -e -p


# if bogofilter failed, return the mail to the queue, the MTA will
# retry to deliver it later
# 75 is the value for EX_TEMPFAIL in /usr/include/sysexits.h

:0e
{ EXITCODE=75 HOST }


# file the mail to spam-bogofilter if it's spam.

:0:
* ^X-Bogosity: Yes, tests=bogofilter
spam-bogofilter


This one is for maildrop, it automatically defers the mail and retries later when the xfilter command fails, use this in your ~/.mailfilter:

xfilter "bogofilter -u -e -p"
if (/^X-Bogosity: Yes, tests=bogofilter/)
{
  to "spam-bogofilter"
}

The following .muttrc lines will create mutt macros for dispatching mail to bogofilter.

macro index d "<enter-command>unset wait_key\n\
<pipe-entry>bogofilter -n\n\
<enter-command>set wait_key\n\
<delete-message>" "delete message as non-spam"
macro index \ed "<enter-command>unset wait_key\n\
<pipe-entry>bogofilter -s\n\
<enter-command>set wait_key\n\
<delete-message>" "delete message as spam"

Integration with Mail Transport Agent (MTA)

1.
bogofilter can also be integrated into an MTA to filter all incoming mail. While the specific implementation is MTA dependent, the general steps are as follows
2.
Install bogofilter on the mail server
3.
Prime the bogofilter databases with a spam and non-spam corpus. Since bogofilter will be serving a larger community, it is important to prime it with a representative set of messages.
4.
Set up the MTA to invoke bogofilter on each message. While this is an MTA specific step, you'll probably need to use the -p, -u, and -e options.
5.
Set up a mechanism for users to register spam/nonspam messages, as well as to correct mis-classifications. The most generic solution is to set up alias email addresses to which users bounce messages.
6.
See the doc and contrib directories for more information

Use of R to verify Bogofilter calculations

The -R option tells bogofilter to generate an R data frame. The data frame contains one row per token analysed. Each such row contains the token, the sum of its database "good" and "spam" counts, the "good" count divided by the number of non-spam messages used to create the training database, the "spam" count divided by the spam message count, Robinson's f(w) for the token, the natural logs of (1 - f(w)) and f(w), and an indicator character (+ if the token's f(w) value exceeded the minimum deviation from 0.5, - if it didn't). There is one additional row at the end of the table that contains a label in the token field, followed by the number of words actually used (the ones with + indicators), Robinson's P, Q, S, s and x values and the minimum deviation.

The R data frame can be saved to a file and later read into an R session (see the R project website: http://cran.r-project.org for information about the mathematics package R). Provided with the bogofilter distribution is a simple R script (file bogo.R) that can be used to verify bogofilter's calculations. Instructions for its use are included in the script in the form of comments.

 

LOG MESSAGES

Bogofilter writes messages to the system log when the -l option is used. What is written depends on which other flags are used.

A classification run will generate (we are not showing the date and host part here):


bogofilter[1412]: X-Bogosity: No, spamicity=0.000227
bogofilter[1415]: X-Bogosity: Yes, spamicity=0.998918

Using '-u' to classify a message and update a wordlist will produce (one a single line):


bogofilter[1426]: X-Bogosity: Yes, spamicity=0.998918,
  register -s, 329 words, 1 messages
    

Registering words ('-l' and '-s', '-n', '-S', or '-N') will produce:


bogofilter[1440]: register-n, 255 words, 1 messages
    

 

A registration run (using '-s', '-n', '-N', or '-S') will generate messages like:


bogofilter[17330]: register-n, 574 words, 3 messages
bogofilter[6244]: register-s, 1273 words, 4 messages

 

FILES

/etc/bogofilter.cf
System configuration file.

~/.bogofilter.cf
User configuration file.

~/.bogofilter/goodlist.db
List of good tokens.

~/.bogofilter/spamlist.db
List of spam tokens.

 

BUGS

bogofilter counts messages on input by looking for "From " lines. As a special case, a single message without "From " line is counted correctly. Multiple messages without intervening "From " lines will be counted as one message.

Bogofilter does not canonicalize the transport encoding or character set, sacrificing precision. We used to believe that spam with enclosures invariably gives itself away through cues in the headers and non-enclosure parts, but this is not true. This will be fixed in a future version.

 

AUTHOR

Eric S. Raymond <esr@thyrsus.com>.

For updates, see the bogofilter project page: http://bogofilter.sourceforge.net/.


 

Index

NAME
SYNOPSIS
DESCRIPTION
THEORY OF OPERATION
OPTIONS
ENVIRONMENT
CONFIGURATION
RETURN VALUES
INTEGRATION WITH OTHER TOOLS
LOG MESSAGES
FILES
BUGS
AUTHOR

This document was created by man2html, using the manual pages.
Time: 21:21:38 GMT, April 26, 2024