Spam filtering


When an incoming e-mail message arrives on BMRB SMTP server it passes it to Mail Delivery Agent program for delivery to user's mailbox. MDA can do a number of things during delivery, including piping the the message through a spam filtering program, delivering to sub-folders, forwarding to different address, etc.

At BMRB we use maildrop MDA and SpamAssassin for spam filtering. SpamAssassin runs a number of tests on the message, calculates its “spam score”, and adds that score to the headers of the message. You can then use maildop's filtering capabilities to do something with the message based on its spam score.

Spam filtering should be on by default: sysadmin will set it up while creating your account. Unless he forgets.

Enable filtering

You should have the .mailfilter file in your home directory with the following content:

# ~/.mailfilter file for maildrop-2.x
# must be mode 600 (rw-------)
import SENDER
xfilter "/usr/bin/spamc -f ${SENDER}"

if( /^X-Spam-Level: \*{7,}/ )
    to "/dev/null"
  • this instructs maildrop to run SpamAssassin on a message and to silently discard messages with spam score of 7 or more. Everything else is delivered to your Inbox (./Maildir).

To see if the file exists and what's in it: use GUI file manager (you may have to enable “show hidden files”) or open the terminal (shell) window and

cd ~
cat .mailfilter

If the output is “No such file or directory”, then the file doesn't exist.

If the file doesn't exist, in the terminal window

cd ~
touch .mailfilter
chmod 600 .mailfilter

then edit .mailfilter with a text editor and paste in the lines above. Available text editors include nedit (GUI) and vi (shell).

SpamAssassin scores

SpamAssassin adds these headers to every message:

  • X-Spam-Flag: YES if message scores required_score (see below) or more
  • X-Spam-Level: spam score as asterisks (*), rounded (e.g. 5 asterisks for spam score 5.5)
  • X-Spam-Status: contains detailed report including required score and the tests run on the message.

You'd normally filter the messages based on one of the first two.

simple example

Discard everything SpamAssassin thinks is spam.

Put in the .mailfilter:

if( /^X-Spam-Flag: YES/ )
    to "/dev/null"

(that is based on X-Spam-Flag and required_score, see below).

  • /dev/null is the Unix bitbucket. Delivering to /dev/null simply deletes the message, it never even gets to your mailbox.

slightly less simple example

Discard everything with SpamAssassin score 7 or above, deliver messages with score 5..7 to a subfolder (for review, in case they are “false positives”).
Create a subfolder spam in your Inbox, then change your .mailfilter to:

if( /^X-Spam-Level: \*{7,}/ )
    to "/dev/null"
if( /^X-Spam-Level: \*{5,}/ )
    to "Maildir/.spam"
  • message processing ends on to command. Messages with score over 7 never get to the second if line because they're handled by
    to "/dev/null"

Many mail client programs can be configured to use SpamAssassin's scores. E.g. in thunderbird you can use junk mail control settings dialog have it automatically move messages marked as junk by SpamAssassin to spam folder. The maildrop way above is more efficient because the mail is filtered on the mail server and is delivered directly to spam. With thunderbird's filter the message first gets delivered to Inbox, then thunderbird has to fetch it from the server to run its junk mail controls, and then it tells the server to move the message from Inbox to spam.

Tuning up SpamAssassin

By default “spam” is messages with score 5 or higher. That is controlled by required_score setting in .spamassassin/user_prefs. Basic rule is: higher spam score will let more spam through, but there's less chance of a legitimate message being marked as spam (“false positive”).

required_score 5

Other frequently used settings in .spamassassin/user_prefs:

  • whitelist_from somebody@somewhere: use if legitimate mail from somebody@somewhere gets mis-identified as spam. You can have as many whitelist_from lines as you need
  • use_bayes 1: enable Bayesian filter (see below)
  • bayes_ignore_from somebody@somewhere: like whitelist_from above, only specific to Bayesian filter.
  • ok_locales en: mark messages in not English character sets as spam. Only one ok_locales line is allowed, but it can contain a list: ok_locales en ja ru (allow western, Japanese, and Russian character sets), or all (ok_locales all is the default). List of locales:
    en - Western character sets in general
    ja - Japanese character sets
    ko - Korean character sets
    ru - Cyrillic character sets
    th - Thai character sets
    zh - Chinese (both simplified and traditional) character sets

(All ISO-8859-* character sets, and Windows code page character sets, are always permitted by default.)

Tuning up Bayesian filter

Bayesian filter works by running statistical analysis on the body of the message. Before it can be used, it must be trained on a representative sample of spam messages.

Collect a few hundred spam messages in a subfolder in your Inbox (e.g. train). Then, logon to the mail server (as of the time of this writing, cowfish) and run the training program:

  ssh cowfish
  cd ~/Maildir/.train
  sa-learn --spam --showdots cur

Then delete messages from train.

Note the leading dot in folder name: .train. This is how mail folders a stored in unix filesystem: Inbox is ~/Maildir, train is ~/Maildir/.train

See also