NOTE: Please also see the CHANGES-0.13 document for a detailed summary. With release 0.13, bogofilter's parsing has changed. As background, Paul Graham has done work to improve the results of his bayesian filter and has published them in "Better Bayesian Filtering" at http://www.paulgraham.com/better.html. He found the following definition of a token to be beneficial: 1. Case is preserved. 2. Exclamation points are constituent characters. 3. Periods and commas are constituents if they occur between two digits. This lets me get ip addresses and prices intact. 4. A price range like $20-25 yields two tokens, $20 and $25. 5. Tokens that occur within the To, From, Subject, and Return-Path lines, or within urls, get marked accordingly. Bogofilter has always done #3 and has tagged for Subject lines for a while. Its parser now does all of these things. Several command line switches and config file options have been added to allow enabling or disabling them. Here are the new switches and options: -Pi/-PI ignore_case default - disabled -Ph/-PH header_line_markup default - enabled -Pt/-PT tokenize_html_tags default - enabled The options can be enabled using the lower case switch or disabled using the upper case switch. When header_line_markup_is enabled, tokens in To:, From:, Subject:, and Return-Path: lines are prefixed by "to:", "from:", "subj:", and "rtrn:" respectively. When tokenize_html_tags_is enabled, tokens in A, IMG, and FONT tags are scored while classifying the message. NOTE: To take full advantage of these changes, additional training of bogofilter is necessary. Here's why: With bogofilter's use of upper and lower case, the wordlists won't match as many words as before. For example, "From" and "from" both used to match "from", but this is no longer the case. As additional training is done, words like these will be added to the wordlists and bogofilter will have a larger number of distinct tokens to use when classifying messages. This will improve its classification accuracy. Similarly, the use of header_line_markup will tokenize "Subject: great p0rn site" as "subj:great", "subj:p0rn", and "subj:site". At first these tokens won't be recognized, so bogofilter won't use them to score the message. After being trained, bogofilter will have these additional tokens to aid in the classification process.