Spam Processing Setup
Note: this documentation is fairly old, and is slightly out-of-date. For instance, I no longer run my own mail server. However, the general strategy described below is still accurate.
Background
Due to mailing list traffic, I get a lot of mail in a day. I get even more spam than legitimate mail, on the order of hundreds to thousands of unsolicited mails per day. This gets really difficult to manage, especially when you consider that I need to pay attention to at least some of my unsolicited mails, in order to adequately support my Debian and Cedar Solutions users.
Over the course of the past few years, I've come up with a setup that works fairly well for me. It's kind of complicated, but it does a good job of helping me balance the competing goals of minimizing time spent dealing with email while maximizing the chance that legitimate email gets dealt with properly. I doubt anyone will want to copy my setup directly, but knowing how it works might be useful to people looking for ideas.
Summary of my Setup
All of my mail is received on and sent from a Debian GNU/Linux box, specifically from daystrom. Daystrom acts as its own SMTP server (accepting SMTP traffic directly on port 25) but sends most mail through my ISP's SMTP server. (Almost no one accepts mail directly from an SMTP server with a dynamic IP address any more.) Daystrom's MTA (mail transfer agent) is Exim 3, installed through Debian's exim package. I really should move to Exim 4, but I'm lazy and loathe to change something that's not broken.
On daystrom, I use mutt for my MUA (mail user agent). I also use gmail, and some mail (discussed more below) is forwarded from daystrom to gmail.
Mutt is a terminal-based mail client. It isn't as "pretty" as other mail clients, but does a better job of meeting my needs than anything else I've ever used. Mutt is extremely flexible and easy to customize.
The core of my spam-processing setup is a filtering engine. I use procmail, because it's well-supported under Linux and is very flexible. I use procmail both to categorize mail and to deliver that mail into appropriate folders. I categorize mail in three ways: based on the sender or source; based characteristics of the email itself; and based on a spam rating provided by a specialized spam classification engine.
The Greybox Stategy
As I mentioned above, I have designed my spam processing setup to meet two competing goals: to mimimize the time I spend dealing with email of any sort, and to maximize the chance that legitimate mail gets dealt with properly and in a timely manner. In order to do this, I rely on what I think of as a "greybox" strategy.
The greybox strategy uses three different kinds of mailboxes: a single primary inbox, which contains emails sent directly to me from trusted senders; a set of mailing list folders, which contain emails related to mailing lists I subscribe to; and a set of greylist folders, which contain emails sent directly to me from untrusted sources. Spam processing is only applied to emails which will be filtered to either mailing list folders or greylist folders.
(Actually, there is also a fourth kind of folder, the spam folder, that holds mail which has been positively identified as spam. However, I never look in that folder unless I think I have lost a message.)
When I am short on time, I only check the primary inbox. It will contain the emails which are most important to me, usually personal or professional emails from people I know and trust. When I have more time, I periodically check the mailing list folders or the greylist folders. The primary inbox will generally not contain any spam (with a few caveats), while both the mailing list folders and greylist folders will likely contain at least some spam (although hopefully not too much if the spam classification engine is doing its job).
My strategy for what to forward to gmail varies. Right now, anything that goes to my inbox is forwarded to gmail, as well as greylisted items that are marked as ham. The greylisted emails got to a special address (the +greylist address) at gmail, so I still sequester them there.
This strategy is not foolproof (there are lots of fools sending spam), but it has helped me get to the point where dealing with spam is just mildly annoying, not exceedingly painful like it used to be.
Classification based on Sender
My primary method of sender-based classification is the whitelist. Each night, I run a cron job to create a list of trusted sender addresses based on four sources
- the addresses in my addressbook
- the addresses I have sent mail to recently (based on my sent-mail folder)
- a simple list of other addresses that I want to treat as trusted even if they don't show up in my sent mail or in my addressbook
- and a gmail whitelist compiled from my gmail contacts and some of the addresses I have sent mail to recently from gmail rather than locally from daystrom.
If you want, you can reference the whitelist and gmailwhitelist scripts at http://cedar-solutions.com/software.html .
Emails sent from addresses in my whitelist (based on the From email header) are automatically routed into my primary inbox, and I don't waste any processing time asking the spam classification engine to look at them. These emails are assumed to be legitimate because I know the sender.
I also do some more complicated routing based on sender, for instance routing bank and credit card emails to my wife so that we both get a copy, or routing all mails from a certain domain (say, work) so that I don't have to track people individually, or routing system emails so they don't get lost. This is usually done through a set of one-off procmail rules, and emails matching these rules are effectively whitelisted, although their source addresses might not show up explicitly in a whitelist file.
There are also a few rules in my procmail configuration which effectively consitute a blacklist. For instance, I have become so annoyed with some people who post to Debian mailing lists that I just send all mail from them directly to /dev/null. (My threshold for this is pretty high, though.)
All of this usually works pretty well, with one major caveat: I get at least a few emails per day with forged source addresses that coincidentally match addresses that I trust (even a few per day from "myself"). I haven't yet decided how to deal with these messages, but for the time being I get few enough that it's only annoying.
Classification based on Other Characteristics
Besides filtering on the From email header, I also filter emails based on other characteristics, mostly on the Subject header and on headers used by various mailing list managers or automated email systems.
For instance, I have found that emails with certain subjects are bogus even if they aren't classified as spam. Examples include mail from idiots who can't remember to type a subject ("Subject: Unidentified subject!"); mail from idiots who don't know how to read instructions ("Subject: unsubscribe") and/or spell properly ("Subject: unsusribe"); or mail from idiot virus checking systems who don't care that the idiot who sent them the virus was not really me ("Subject: Virus Warning Message"). These messages are dumped directly to the spam folder and aren't ever passed to the spam classification engine.
I also filter mails related to the Debian BTS directly into my primary inbox, so I don't ever lose bug tracking information. These emails are identified by certain content in the Resent-Sender, Resent-From, Resent-To and Resent-CC headers. These emails are effectively whitelisted, and as such are never passed to the spam classification engine.
Mailing list traffic is treated differently. I don't filter this mail until after it has been passed to the spam classification engine. Once I know the mail is legitimate, I filter on headers such as Resent-From, Resent-To, List-Id or X-Been-There and route the mail to the various mailing list folders.
Spam Classification
Bogofilter
I use bogofilter for spam classification. Bogofilter is a mail filter that classifies mail as spam or ham (non-spam) by a statistical analysis of the message's header and content (body).
The statistical technique used by bogofilter is known as the Bayesian technique, and was originally described by Paul Graham in his article A Plan for Spam. If you are interested in a summary of how bogofilter works, check out the theory of operation page on the Bogofilter website. Suffice it to say that a Bayesian filter uses a database of known, classified emails and applies statistical techniques to compute a probability that a given email is either spam or ham.
There are a number of good open source Bayesian filters. I chose bogofilter because it has minimal dependencies and is written in C with an eye toward performance. Although my experience with Bogofilter is somewhat limited, I have found no reason yet to consider looking at other filters.
Training
Before using Bogofilter, you must "train" it. Initial training is done by providing Bogofilter with a corpus of emails with a known classification - either ham or spam - which bogofilter then uses as the basis for its future classification efforts. Once initial training is done, you typically continue to train the filter on an ongoing basis, by telling it when you find messages which have been misclassified.
I trained bogofilter using all of my saved off mail (including sent mail), as well as a corpus of spam that I had been saving off literally for years.
Before actually handing off the mail to Bogofilter, I had to clean it up to remove content (particularly mail headers) which might confuse bogofilter. For instance, I did not want bogofilter to analyze existing spam processing or whitelist headers, just the actual content of the mail. (See the strip-special script.) Once I was done stripping headers out of all of my saved-off mail, I combined it all into two huge (40-50 MB each) folders named ham and spam.
Once I had created my ham and spam folders, I used the script bogominitrain.pl (included with the Bogofilter distribution) to do the initial training. I ran something like this:
bogominitrain.pl -fnv ~/.bogofilter ham.mbox spam.mbox '-o 0.9,0.3'
This script enters a loop and continues training and re-training bogofilter until it classifies each and every sample email correctly. The process takes a while, but the result is quite good.
Applying the filter
As I mentioned above, I only pass certain messages to Bogofilter for classification. This is actually a holdover from when I was running my mail system on really slow (P-90) hardware which could barely keep up, but it's still a good idea. Why bother adding load to the system when it's not needed?
Basically, I ask Bogofilter to classify any email which is not either whitelisted or blacklisted. When bogofilter classifies an email, it modifies the email and adds a special header, the X-Bogosity header. I use a procmail rule to filter on the value of this header. If the header's value is either "Yes" or "Spam", the email goes directly to the spam folder. Otherwise, the email falls through to other rules and is placed into some other folder.
I continue to train bogofilter on pretty much any mail which ends up in a greylist folder. Any mail in one of these folders is either legitimate (in which case I positively identify it as ham and put it into my primary inbox) or is not (in which case I positively identify it as as spam and save it in a spam folder in case I might need it later). I do this using a few mutt macros which you can find below.
Integrating it all
Summary
I can't possibly give you every piece of my setup. In part, this is because I don't want everyone in the world to see my filtering rules, and in part, it's also because not everything I've written is really suitable for public distribution. However, I will try to provide some examples that you can work from.
Useful mutt keybindings
These are the bogofilter-specific keybindings I use in mutt. Using these keybindings, CTRL-h can be used to classify a mail as ham and CTRL-p can be used to classify a mail as spam.
# Bogofilter bindings for use with Bayesian filter macro index \Ch "|/usr/bin/bogofilter -n -v" # classify as Ham macro pager \Ch "|/usr/bin/bogofilter -n -v" # classify as Ham macro index \Cp "|/usr/bin/bogofilter -s -v" # classify as sPam macro pager \Cp "|/usr/bin/bogofilter -s -v" # classify as sPam
These are very simplistic key bindings. After bogofilter runs, you'll have to "Press enter to continue". Also, you'll have to save off the spam to some other folder on your own, etc.
I've lately modified the bindings to look like this. Note that each binding should be on a single line, although I've formatted them for readability here in the wiki.
# Bogofilter bindings to tag a message as Ham
macro index \Ch "<enter-command>unset wait_key\n
<pipe-entry>/usr/bin/bogofilter -n -v\n
<enter-command>set wait_key\n"
macro pager \Ch "<enter-command>unset wait_key\n
<pipe-entry>/usr/bin/bogofilter -n -v\n
<enter-command>set wait_key\n"
# Bogofilter bindings to tag a message as sPam and save to the =SPAM folder
macro index \Cp "<enter-command>set auto_tag\n
<enter-command>unset wait_key\n
<pipe-entry>/usr/bin/bogofilter -s -v\n
<enter-command>set wait_key\n
<save-message>=SPAM\n
<enter-command>unset auto_tag\n"
macro pager \Cp "<enter-command>unset wait_key\n
<pipe-entry>/usr/bin/bogofilter -s -v\n
<enter-command>set wait_key\n
<save-message>=SPAM\n"The ham macros are basically equivalent to the original ones above, except they eliminate the "press any key" annoyance. The spam macros are different, however. The spam macros also eliminate the "press any key" annoyance, and then automatically save the message to the =SPAM folder.
The index spam macro is slightly different than the pager spam macro. This is because auto_tag must be disabled before calling <save_message> so that the command gets applied to all tagged messages (if there are any).
General procmail notes
Remember that procmail recipes are executed from top-to-bottom within your .procmailrc file. For instance, it would do no good to filter "unsubscribe" mailing list emails into =SPAM at the bottom of the rules, since they would have already been filtered into their respective mailing list folders earlier up in the rules.
My procmail rules file contains the following sections:
- Environment setup
- Remove special headers
- Filter special emails
- Filter job-related emails
- Throw away bad messages
- Identify whitelisted emails
- Identify spam
- Deliver to spam folder
- Deliver to mailing list folder(s)
- Deliver to greylist folder(s)
Also, keep in mind a few things about procmail recipes. First, note that that "0c" (versus just "0") says to continue on after executing the rule, and "0B" (versus just "0") says to grep the body, not the header. Second, note that the forward rules don't have a terminating ":" (they are ":0c" rather than ":0c:"). This is so procmail doesn't use a lockfile for those rules. There's no need for a lockfile when invoking sendmail.
If you need more information about how to use procmail, I suggest checking out the procmail website. What I have put here can't substitute for an understanding of how procmail works.
Removing spam-processing headers
One of the first things I do in my procmail rules is to remove existing spam headers from each mail I receive. This way, later rules can't get confused (and spammers can't forge something and get past my defenses).
:0 fwh * ^X-Whitelist | formail -I"X-Whitelist" :0 fwh * ^X-Bogosity | formail -I"X-Bogosity" :0 fwh * ^X-Spambayes-Classification | formail -I"X-Spambayes-Classification"
Filtering mails based on email headers
Here are some examples of filtering rules based on arbitrary email headers. These examples are for Debian-related emails.
# Always put Debian bug reports in inbox :0: * Resent-Sender: Debian BTS $DEFAULT # Sometimes, things sent to XXXXXX-submitter@b.d.o get lost. # This seems to be in them. :0: * Resent-From: Debian BTS $DEFAULT # Filter Debian build-related emails :0: * ^From:.*installer@ftp-master.debian.org INBOX-debian-build # Filter Debian build-related emails :0: * ^From:.*katie@ftp-master.debian.org INBOX-debian-build
Sending automatic carbon copies
Here's an example of how to send automatic carbon copies to the wife and also whitelist the mail into your primary inbox.
:0c * From:.*bank.* ! wife@mydomain.com :0: * From:.*bank.* $DEFAULT
Identifying whitelisted emails
This is the rule I use to identify whitelisted emails. Remember, the .whitelist file is just a list of email addresses, one per line.
:0
* ? formail -x"From" -x"From:" -x"Sender:" -x"Reply-To:" -x"Return-Path:" \
| sed 's/^.*<//' \
| sed 's/>.*$//' \
| sed 's/^ *//' \
| sed 's/ .*$//' \
| egrep -is -f ${HOME}/.whitelist
{
:0 fwh
| formail -a"X-Whitelist: Yes"
}The formail line first extracts all of the interesting fields out of the mail, and then tries to format them such that they can be used to look for matches in the whitelist file. A typical result might be something like this:
pronovic@cedar-solutions.com Sat Jan 14 15:20:55 2006
Kenneth Pronovici <kenneth.pronovici@cedar-solutions.com>
Kenneth Pronovici <pronovic@ieee.org>
Kenneth Pronovici <pronovic@cedar-solutions.com>The first line is the "From" header, and the others are the "From:" "Reply-To:" and other headers. After the various sed statements, what gets passed to the egrep command is something like this:
pronovic@cedar-solutions.com kenneth.pronovici@cedar-solutions.com pronovic@ieee.org pronovic@cedar-solutions.com
The egrep command will return normal if any of these addresses is found in the whitelist file, which is what we want.
Identifying spam
This is the rule I use to identify spam using Bogofilter. Note that I don't bother to run Bogofilter against any mail which has been whitelisted.
:0fw
* !^X-Whitelist: Yes
| bogofilter -e -p
:0e
{
EXITCODE=$?
}
Filtering to spam folder
These are the rules I use to classify spam based on Bogofilter's header. Normally, this would be done after running Booofilter but before routing mail to various mailing list folders.
:0: * ^X-Bogosity: Yes, tests=bogofilter SPAM :0: * ^X-Bogosity: Spam, tests=bogofilter SPAM
Filtering to greylist folders
Finally, these are the rules I use to classify mail as "legitimate" or "suspect" based on the whitelist header. Anything which does not have the whitelist header will be considered "suspect" and will be placed in a greylist folder. Anything else will be considered "legitimate" and will be placed into my primary inbox (this happens by "falling off" the end of .procmailrc.)
The example below shows more than one greylist inbox, one for emails to cedar-solutions.com and another for all other emails. I find this useful because it helps me mentally prioritize and sort the garbage that does get through (I actually do it for all of my different email addresses).
:0: * !^X-Whitelist: Yes * ^To:.*cedar-solutions\.com.* INBOX-GREYLIST-CEDAR :0: * !^X-Whitelist: Yes INBOX-GREYLIST