NOTE: Please also see the CHANGES-0.13 document for a detailed summary.

With release 0.13, bogofilter's parsing has changed.  As background,
Paul Graham has done work to improve the results of his bayesian
filter and has published them in "Better Bayesian Filtering" at
http://www.paulgraham.com/better.html.  He found the following
definition of a token to be beneficial:

   1. Case is preserved.

   2. Exclamation points are constituent characters.

   3. Periods and commas are constituents if they occur between two
      digits. This lets me get ip addresses and prices intact. 

   4. A price range like $20-25 yields two tokens, $20 and $25.

   5. Tokens that occur within the To, From, Subject, and Return-Path
      lines, or within urls, get marked accordingly.

Bogofilter has always done #3 and has tagged for Subject lines for a
while.  Its parser now does all of these things.  Several command line
switches and config file options have been added to allow enabling or
disabling them.  Here are the new switches and options: 

   -Pi/-PI	ignore_case		default - disabled
   -Ph/-PH	header_line_markup 	default - enabled
   -Pt/-PT	tokenize_html_tags 	default - enabled

The options can be enabled using the lower case switch or disabled
using the upper case switch.

When header_line_markup_is enabled, tokens in To:, From:, Subject:,
and Return-Path: lines are prefixed by "to:", "from:", "subj:", and
"rtrn:" respectively.

When tokenize_html_tags_is enabled, tokens in A, IMG, and FONT tags
are scored while classifying the message.

NOTE:

To take full advantage of these changes, additional training of
bogofilter is necessary. 

Here's why:

With bogofilter's use of upper and lower case, the wordlists won't
match as many words as before.  For example, "From" and "from" both
used to match "from", but this is no longer the case.  As additional
training is done, words like these will be added to the wordlists and
bogofilter will have a larger number of distinct tokens to use when
classifying messages.  This will improve its classification accuracy. 

Similarly, the use of header_line_markup will tokenize "Subject: great
p0rn site" as "subj:great", "subj:p0rn", and "subj:site".  At first
these tokens won't be recognized, so bogofilter won't use them to
score the message.  After being trained, bogofilter will have these
additional tokens to aid in the classification process.