HOWTO: Bayesian filtering with SpamAssassin, courier-imap and maildrop

Document version: 1.1; created 2004-02-07; last modified 2005-07-10.

Introduction
What's needed?
Patch the Source
Configure SpamAssassin
Creating mail folders
Configure Maildrop
Configure the IMAP daemon
Configure the MTA
Done

Introduction

You know bayesian spam filtering? It's a filtering method that "learns" from the user whether a message is spam or ham (== non-spam). If you marked enough messages, the filter can determine a probability whether the message is spam or ham. The accuracy is based on how much (and how consequently) messages you have marked before. If you want to learn more, check out the websites of Paul Graham.

So I thought about a user-friendly interface that works with most email clients to let the user mark messages. SpamAssassin, already an excellent spam filter, has a built-in bayes filter since version 2.50. Maildrop (part of the Courier mail suite) can drop messages depending on regexp matching into Maildir folders, which can be read from some IMAP daemons. In a Maildir, each message is one file and vice versa. SpamAssassin contains the tool sa-learn to mark messages (or files containing messages) as spam or ham. So the best thing would be: let the filters of SpamAssassin do the spam rating and let it write the spam level into the mail header. Now maildrop can pre-sort the messages depending on the X-Spam-Level mail header into the mail folders INBOX, AssumedSpam and Spam. If there are false positives (or negatives), the user can sort the remaining messages into INBOX or Spam, emptying AssumedSpam. Now the sa-learn program needs to be called to train the bayes filter. A cron job over all mailboxes is inefficient, because we do not want all messages re-fed to the filter every night or so (It just creates some load on the server, SpamAssassin keeps track of which messages are already learned). And if the user did not sort his mails, the wrong messages are fed into sa-learn. Last, but not least, I watched some users moving spam into the spam folder and immediately after that deleting the contents of the spam folder (You'll figure out why that may be A Bad Idea).

Now I thought: the IMAP daemon needs to call sa-learn to mark the messages. That way, messages are only marked when the user actively moves (or copies) them. But I did not find any IMAP daemon that supports calling an external program on the event of copying/moving a message.

Consequently, I enhanced (patched) the Courier IMAP daemon to call external programs. Now, if the user moves a message from one folder to another (which is achieved by "copy and delete" in the IMAP protocol), the imapd looks in its configuration for a rule what to call.

The disadvantage of this spam filtering system is a long waiting time when moving or copying a message from one IMAP folder to another. On my Pentium/166MHz server, the filter needs about 6 seconds for learning one message. On my P4/2GHz server, it is about 1 second per message.

What's needed?

You need to have a working mail environment, where the users mailboxes are organized in the "Maildir" format. You need to have a working IMAPD up and running.

Since I have a Debian GNU/Linux woody system, I patched the Courier IMAPD 1.4.3-2.3. Additionally, I installed spamassassin from Norbert Tretkowskis backports repository, because he provides a version >2.60 rather than 2.20, which is in the official debian release. Also, I installed courier-authdaemon and courier-maildrop. Every MTA (mail transport agent) that supports Maildir should suffice, I tested courier-mta and qmail (the first uses .courier, the latter .qmail as config files; the content of these files will be the same in this howto.

It shouldn't be too hard to patch another version of the Courier IMAPD (or maybe another IMAPD), since I only had to edit one function in one file.

Patch the source (or download my binary)

Over time, there were quite some versions of courier-imap I used and patched, but all on Debian GNU/Linux. So if you use another distribution or version, you'll have to adopt the instructions.

Get the source and unpack it. Then edit in imap/storeinfo.c the function copy_message(). You may have a look at my storeinfo.c (see below). I solely inserted two blocks of code, marked by the comments //DAVE and //EVAD.

WARNING: Do not simply copy my storeinfo.c into the source tree you just unpacked, unless you have unpacked the debian courier-0.37.3-2.3 source package. Instead, find out where I inserted my code and insert just that.

The file /etc/courier/imapd needs a bit of changing. It is yet undocumented in the file itself. It is only documented in this document below.

Files:

storeinfo.c 1.4.3-2.3maligree1
storeinfo.c 1.7.3-9.maligree.1
storeinfo.c 3.0.2-1.backports.org.1+maligree.2

storeinfo.c diff 1.4.3-2.3maligree1
storeinfo.c diff 1.7.3-9.maligree.1
storeinfo.c diff 3.0.2-1.backports.org.1+maligree.2

For woody without backports:
courier-imap_1.4.3-2.3maligree1_i386.deb
For woody with early backports:
courier-imap_1.7.3-9.maligree.1_i386.deb
courier-imap-ssl_1.7.3-9.maligree.1_i386.deb
For woody with later backports:
courier-imap_3.0.2-1.backports.org.1+maligree.2_i386.deb
courier-imap-ssl_3.0.2-1.backports.org.1+maligree.2_i386.deb
For sarge:
courier-imap_3.0.8-4.maligree.1_i386.deb
courier-imap-ssl_3.0.8-4.maligree.1_i386.deb

Notice: Between 2003-04-01 and 2003-04-03 was version 3.0.2-1.backports.org.1+maligree.1 online. This release contains a major bug in my patch; if you downloaded this version please use the new version (...+maligree.2)!

Configure SpamAssassin

First, you should have a look at the /etc/defaults/spamassassin file (maybe this file only exists on a debian system). Make sure ENABLED is set to 1. That way, spamd will be started and you achieve a gain in performance, because config files are only read once.

You do want to use spamc and spamd here for performance. So do not forget to restart SpamAssassin after configuration.

All configuration done in this section is done in /etc/spamassassin/local.cf

In order to let maildrop do some sorting, you have to configure SpamAssassin to always insert the X-Spam-Level header and do it in an easily parseable way. I use '-' as the spam level indicator character instead of an asterisk, so I do not have to escape that char in the maildrop configuration below.

always_add_headers 1
spam_level_stars 1
spam_level_char -

SpamAssassin 2.5x (2.60 and higher don't have this) has a "safety zone" implemented that prevents messages from being automatically learned if the score is within 4 points of "required_hits". Additionally, the autolearning takes place before the bayes filter adds his score. So we do not have that much of a learning effect we want, since some messages simply aren't learned. Conclusio: we deactivate autolearning and do it manually in maildrop.

auto_learn 0

We do not need to set required_hits, so you may set it as you wish.

required_hits 2.0

At last, we do not want SpamAssassin to change the mail apart from adding a few headers:

rewrite_subject 0
report_safe 0

Maybe you want to configure SpamAssassin a bit further? See perldoc Mail::SpamAssassin::Conf for further options.
Oh, and here are my additional lines to prevent the bayes filter analyzing the header put into the mail by other spam checking systems:

bayes_ignore_header X-purgate
bayes_ignore_header X-purgate-ID
bayes_ignore_header X-purgate-Ad
bayes_ignore_header X-GMX-Antispam
bayes_ignore_header X-Antispam
bayes_ignore_header X-Spamcount
bayes_ignore_header X-Spamsensitivity

You should exclude any headers known to you that may confuse the bayes filter. For example, I receive a "spam trap list" -- a mailinglist connected to a spamtrap. Any headers associated with this list should be ignored by the bayes filter. You do not have to worry about re-feeding the headers SpamAssassin generates into the bayes filter. The bayes filter knows these headers and automagically skips them.

As far as I experienced, the folder "AssumedSpam" will be filled mostly with spam. Maybe a threshold of 5.0 is a bit high -- SpamAssassin without bayes is really good when used with a threshold of 2.5 and a human supervisor who revises the mails rated as spam. Maybe you want to decrease that threshold as soon as the bayes filter actually rates some mail (see the section "Done" on this topic).

Creating mail folders

You need to create the mail folders in order to put something in. So become root and create them.

The following example assumes that you have all your user homes in /home and their Maildirs in $HOME/Maildir. If you have them elsewhere, you have to adopt the script or do it manually. Do not forget: Some users do not like it if you mangle around in their homes. Sometimes it would be better to build a script and advertise it in /etc/motd. That way, the user can choose. This note applies also to the other sections.

# cd /home
# for a in *;
> do su $a -c "maildirmake /home/$a/Maildir/.Spam";
> done
# for a in *;
> do su $a -c "maildirmake /home/$a/Maildir/.AssumedSpam";
> done

You could also mark the folders as "subscribed", so that the mail clients automatically show them:

# for a in *;
> do su $a -c "echo -e \"INBOX.Spam\nINBOX.AssumedSpam\n\" >>/home/$a/Maildir/courierimapsubscribed";
> done

You could also create the mail folders in /etc/skel, so that a new user will have them automatically.

(I assume that you already have a working mail system, so creating the Maildir itself should already be done.)

Configure Maildrop

Think about the spam level. At which point should maildrop put the messages into INBOX and at which point the messages should go into the spam folder? I'd use <1 for putting messages into INBOX and >=5 for putting them into the spam folder. The others go into "AssumedSpam". If messages go into INBOX, the bayes filter should learn them as ham (the user can correct mistakes), messages that go into the spam folder should be learned as spam. Therefore, we need a "cc" statement here that pipes the message through sa-learn. Maildrop stops parsing the config file after a "to" statement is executed, so the order is crucial!

Configuring Maildrop is easy. Just edit /etc/courier/maildroprc (the file does not exist by default):

if (/^X-Spam-Level: *-----.*$/)
{
cc |/usr/bin/sa-learn --single --spam
to "Maildir/.Spam"
}

if (/^X-Spam-Level: *-.*$/)
{
to "Maildir/.AssumedSpam"
}

cc |/usr/bin/sa-learn --single --ham
to "Maildir"

All messages with a spam level higher than or equal to 5 (five dashes or more) should go into the spam folder. All remaining messages with a spam level higher than or equal to 1 should go into the "AssumedSpam" folder. The others should go into INBOX.

Configure the IMAP daemon

The Courier IMAPD uses environment variables for configuration. Therefore, I used the standard imapd config file /etc/courier/imapd to put my own config variables into. I named them ON_COPY_TO_BOX_folder, where folder is the name of the mailbox. Every time a message is copied/moved to a folder, the value of the environment variable is called. The string %s in the value will be replaced with the filename. Put %s in single quotes ('%s'), because folder names could contain special chars. You do not want them to be interpreted by the shell.

You only need to install the patched version of the imapd and append some lines to the config file. This lines are for courier-imap 1.4.x (that means a plain woody system without any backports):

ON_COPY_TO_BOX_Spam="/usr/bin/sa-learn --spam --file '%s'"
ON_COPY_TO_BOX_Trash=""
ON_COPY_TO_BOX_AssumedSpam=""
ON_COPY_TO_NOT_LISTED="/usr/bin/sa-learn --ham --file '%s'"

...and these lines for 1.7.x:

ON_COPY_TO_BOX_Spam="source /etc/profile; /usr/bin/sa-learn --spam \`pwd\`/'%s'"
ON_COPY_TO_BOX_Trash=""
ON_COPY_TO_BOX_AssumedSpam=""
ON_COPY_TO_NOT_LISTED="source /etc/profile; /usr/bin/sa-learn --ham \`pwd\`/'%s'"

...and these for 3.0.8 (plain sarge system):

ON_COPY_TO_BOX_Spam="source /etc/profile; /usr/bin/sa-learn --spam \`pwd\`/'%s' --dbpath \`pwd\`/../.spamassassin"
ON_COPY_TO_BOX_Trash=""
ON_COPY_TO_BOX_AssumedSpam=""
ON_COPY_TO_NOT_LISTED="source /etc/profile; /usr/bin/sa-learn --ham \`pwd\`/'%s' --dbpath \`pwd\`/../.spamassassin"

Every message copied into "Spam" will be learned as spam. Every message copied into "Trash" or "AssumedSpam" will be left alone. Every message copied into any other folder (including those the user has created) will be learned as ham. Be careful with the folder names; they have to be case sensitive. Every character in in the folder name which is not in [A-Za-z_] has to replaced with an underscore.

If you want to log the output of sa-learn, you could append |/usr/bin/logger -t sa-learn, e.g.:

ON_COPY_TO_BOX_Spam="source /etc/profile; /usr/bin/sa-learn --spam \`pwd\`/'%s' |/usr/bin/logger -t sa-learn"
ON_COPY_TO_BOX_Trash=""
ON_COPY_TO_BOX_AssumedSpam=""
ON_COPY_TO_NOT_LISTED="source /etc/profile; /usr/bin/sa-learn --ham \`pwd\`/'%s' |/usr/bin/logger -t sa-learn"

Something's fishy with sa-learn? Try the -D (debug) switch. It gives you a a whole lot of information in the system log.

Configure the MTA

Now the MTA has nothing more to do than piping the mails through spamc and letting maildrop do the sorting. You can configure it system-wide or per-user.

Per-user filtering (with courier-mta or qmail)

You need to create a per-user config file in the users home. In the Courier MTA, it is called .courier, in qmail it is .qmail. The syntax is the same:

|/usr/bin/spamc|/usr/bin/maildrop

Do not forget to alter the path, if necessary :-)

If you want to create them for all your users, do the following as root:

# cd /home
# for a in *;
> do su $a -c "echo \"|/usr/bin/spamc|/usr/bin/maildrop\" >/home/$a/.courier && chmod 644 /home/$a/.courier";
> done

For qmail, use .qmail instead of .courier, respectively. You could also create the file in /etc/skel/.

System-wide filtering (with the Courier MTA)

Edit /etc/courier/courierd and set the DEFAULTDELIVERY variable as follows:

DEFAULTDELIVERY="|/usr/bin/spamc|/usr/bin/maildrop"

System-wide filtering (with qmail)

This section was contributed by Raymond den Ouden.

For this to work you'll have to install safecat: http://budney.homeunix.net:8080/users/budney/linux/software/safecat.html or http://www.ibiblio.org/pub/Linux/utils/file/safecat-1.11.tar.gz.

If you change in /var/qmail/control/defaultdelivery the line |./.maildir/ to either

| /usr/bin/spamassassin -P | maildir ./.maildir/

for direct scanning or

| /usr/bin/spamc | maildir ./.maildir/

for scanning with the spamd daemon, you will have a system wide qmail spam-filter

Done

So you've got your auto-learning spam filter up and running. Perhaps you are looking into the headers of your mails, curious about what SpamAssassin has done and looking for some sign of the bayes filter. At first, the bayes filter does not give a sign. It has to learn first both spam and ham, namely 200 mails each. After that, you will see that the X-Spam-Status header has learned a new test: BAYES. If you find that word in the tests list, the bayes filter is eventually active and rating your mails.

Now there's only one thing left: Drop me a mail if you were successful. I'd like to know if my work also works for other people.

Copyright 2003 by Dave Kliczbor. Feedback in english or german welcome. Original location of this document: http://da.andaka.org/Doku/imapspamfilter.html
Free distribution of this document is allowed.

Part of this document was contributed by Raymond den Ouden (r dot denouden at psychotek dot com).