...making Linux just a little more fun!

Procmail/GMail-based spam filtering

By Ben Okopnik

In the past month or so, I've had a chance to develop and polish a simple but highly effective anti-spam system. My requirements, and my reason for developing it, form a rather narrow and specific niche - my network connectivity situation is quite unusual, rather similar to what a lot of road warriors encounter - but the solution is nicely generalized and usable by anyone with a GMail account and running Linux. It's very fast and not at all CPU-intensive (unlike most anti-spam solutions), and so far, it has an excellent track record for accuracy (zero false positives, very small number of false negatives once past the initial test cycle.) At this point, it looks stable enough that I feel like sharing it with the Linux community is worthwhile; it also has enough flexibility for any experimentation that you may want to do.

First Steps

Some time ago, I posted a question on the Linux Gazette's Answer Gang list; I was looking for a solution to my somewhat unusual situation which had left me stumped for a number of years. In essence, it came down to this: given that I travel and move around a lot, and thus have unpredictable and often fragile and/or slow connectivity, how do I filter spam effectively?

The general gist of that conversation came down to exactly what I had learned and expected over the years of struggling with this topic:

The Base State

I had been using SpamAssassin for several years - but in the last couple of months, the frequency of spam mails that it let through became intolerable, despite the best tuning I could do. In addition, it filtered out a number of valid emails - i.e., false positives - which was a much worse problem, with a much greater hassle attached to it: every few days, I had to do a visual scan of my spam inbox hoping to spot valid emails before my eyes glazed over from zooming over thousands of messages (and I'm convinced that I've lost at least a few due to those factors.) All of this added up to a simple imperative: I had to either change my spam-filtering approach or resign myself to my email becoming progressively less useful and less reliable. The latter was not an option, since most of my business is either done via, or at least partially involves, email.

The Makeover

Initially, I started experimenting with a challenge-response system. The basic premise of such a system is that it depends on two lists containing email addresses - a whitelist (i.e., all emails from that address are accepted) and a blacklist (all emails from that address are discarded.) Anything in between gets tagged and held, while a one-time confirmation message is sent back to the sender's address; if they reply, that address gets added to the whitelist and their message is released.

This was an OK solution - but I was unhappy about the additional load that it generated, both in the number of emails necessary as well as the necessity of taking the sender's time. The latter, by the way, is usually used as the standard reason for not implementing C-R more widely: it is ostensibly "offensive" to people to get and answer confirmation messages. The standard scenario portrays the outraged receiver deleting the confirmation email in a huff (or perhaps printing it out, throwing it on the ground, and jumping up and down on it in a rage until it's all shredded or they have a stroke due to apoplexy.) Personally, I strongly disagree with that so-called "reasoning" and find it offensive: anyone who does not consider communicating with me to be of enough value that they can't hit 'Reply' once is unwelcome in my mailbox in the first place. However, as a personal preference, I dislike adding to anyone's workload - no matter in how miniscule the fashion - without a good reason, and if it can be avoided at no cost to me, and no reduction of functionality in my spam filtering, I'll be happy to do it.

In that light, one of the responses from The Answer Gang really piqued my interest: Steve Brown's idea of using GMail as an external filter (thanks, Steve!) I decided to combine the best features of C-R and external filtering to create my ultimate solution, which eliminated the response requirement. Although it took quite a bit of experimenting initially, the results have been excellent.

For comparison purposes, here's how the new system stacked up against my tweaked-to-the-max SpamAssassin system. It may be relevant to note that, as the Editor-in-Chief of LG, I'm in a rather exposed position, spam-wise: my email address is out there in hundreds of thousands of places, and I usually make no attempt to disguise it. As a result, I get ~1000 messages per day, with 98-99% of those being spam. Pretty ugly... but on the other hand, it makes for a great test bed: either my solutions work really well, or they fail abysmally. That's a test environment that's really meaningful!


False positives
(real emails treated as spam)
False negatives
(spam emails treated as real)
Procmail/Gmail
(first week/subsequent weeks)

4/0

22/6

SpamAssassin
(recent weekly averages)

1-2

36-70

Again, this system has only been in operation for a little over a month - but the results, once I was done tuning it, have been rock-stable. For myself, I'm pretty excited about it: the countless hours that I've spent tuning and retuning SpamAssassin and looking through the spam bucket to see if it mis-identified something are now just a bad memory. I still check the "all incoming mail" list from my current system once in a while (more and more rarely as time goes on), just to confirm that I'm not tossing any valid emails - but given the mechanism that's in use, I feel pretty secure about it not discarding any email without me having explicitly asked it to do so. That's a very, very good feeling.

Setting it all up

The initial part of setting up the system, whether C-R or otherwise, consists of creating all the relevant files - primarily, the whitelist and the blacklist. In the configuration section of the .procmailrc file that I'll make available at the end of this article, you can call them whatever you like; for myself, I used '~/.mail-accept-list' and '~/.mail-deny-list', respectively. I also created a list of symlinks to all the relevant files so I could look at them easily right from my home directory:

MAIL-ACCEPT-LIST	->	~/.mail-accept-list
MAIL-DENY-LIST		->	~/.mail-deny-list
MAIL_PROCMAILRC		->	~/.procmailrc
MAIL_PROCMAIL_LOG	->	/var/log/procmail
MAIL_SAVE_ALL		->	~/.mail_save_all

The names are, I hope, obvious indicators of the function of each file. If you're not familiar with "procmail", it is a very powerful and commonly-used email processor written by Stephen R. van den Berg. It uses '~/.procmailrc' as its configuration file; this is composed of "recipes" that determine how to process mail. My system is constructed of those recipes, plus a few external files and system utilities.

Before we go on to that, though, we'll need to populate the whitelist and the blacklist. If, like me, you've been saving your email - and I've got more than 20 years of mail archives - that's not too hard; all we need to do for the initial whitelist is extract the addresses of anyone who has ever written to me as well as those to whom I've written. (Yes, it's possible that some of those will need to be blacklisted later - but that's so simple that it's not worth worrying about.) I used a combination of shell scripting, "formail", and Perl to do the extraction [1]. Since I've learned over the years that various mail clients do some really ugly things to mail headers, I use extreme caution and circumspection in processing them; in most cases, this means a "belt-and-suspenders" sort of an approach. In this case, I'm using "formail" to concatenate ('-c') continued fields in the header and split ('-s') the mboxes into individual emails, and Perl to extract either the 'From:' address (preferred) or, failing that, the 'Return-Path:' address.

#!/bin/bash
# Created by Ben Okopnik on Mon Jun 28 15:31:08 EDT 2010

# 'cd' to your mail directory
cd ~Mail

for file in *	
do
	# Ignore all directories and the "Sent_mail" file (we'll process that later)
	[ "$file" == "Sent_mail" -o -d "$file" ] && continue
    echo "Processing '$file'"
	formail -cs \
		perl -wlne'$f=$1 if /^(?:return-path|from):.*?([\w\.=\-]+@[\w\.=\-]+\w+)/i;print $f and last if /^$/' \
		< "$file" >> /tmp/whitelist
done

# Process the mail that I've sent; this time, we'll extract the 'To:' headers
echo "Processing the 'Sent_mail' file"
formail -cs \
	perl -wlne';print $1 and last if /^To:.*?([\w\.=\-]+@[\w\.=\-]+\w+)/i' \
	< Sent_mail >> /tmp/whitelist

sort -u /tmp/whitelist -o /tmp/whitelist

So there it is; a list of all my "validated" email addresses collected into a single file (/tmp/whitelist). Note the last line: this produces a list of sorted addresses with no repeats. Not all that complex, right?

The blacklist is even less complicated. Since we're going to stamp all our outgoing email with a special header that identifies it as really being from us, the first thing we'll put into the blacklist is... all our valid email addresses. No fooling. Seems a bit counterintuitive, but that's exactly what we need to do - because spammers very often send their stuff with it being marked as coming from the same address they're sending it to. This approach gets rid of that very large category, painlessly and safely. You'll see precisely how this works as we go through the .procmailrc file.

Next, let's take a look at the .procmailrc file itself. Mine has a few things in it besides the anti-spam system, so I'll highlight just the bits that we're discussing. Let's take a look (ignore the line numbers; they're not part of the code, and are there just so I can refer to a given line):

001	PATH
002	SHELL=/bin/sh
003	MAILDIR=/var/spool/mail
004	DEFAULT=$LOGNAME
005	LOGFILE=/var/log/procmail
006	# VERBOSE=on
007	
008	# This gives you the 'From:' address if it's available, or the 'Return-Path:' address otherwise.
009	:0 hw
010	FROM=|/usr/bin/perl -wlne'$f=$1 if /^(?:return-path|from):.*?([\w\.=\-]+@[\w\.=\-]+\w+)/i;print $f and last if /^$/'

The first six lines just set up the procmail variables. The only bits to note are that you may not necessarily want your procmail logfile to be in /var/log (in fact, you'd need root permissions to set that up); also, 'VERBOSE=on' is currently commented out but still there in case you want to enable it for troubleshooting. When enabled, it produces a lot of output in the logfile, and can be very useful. Line 10 is, of course, the sender address extractor that we used to such good effect earlier.

Now, let's jump right to the spam filter:

011	#************* GMAIL-BASED ANTI-SPAM SYSTEM **************
012	#
013	# Customize all these constants as necessary:
014	MY_EMAIL=ben@okopnik.com
015	MY_GMAIL=okopnik@gmail.com
016	# Spam-Kill stamp; use some unique string without spaces
017	SPAM_KILL=74d04eab1341a01117de96f2
018	# "Secret word" for email control messages
019	SECRET=Funky
020	
021	FORMAIL=/usr/bin/formail
022	GREP=/bin/grep
023	SENDMAIL=/home/ben/bin/bssmtp
024	
025	DB=$HOME/.mail-accept-list
026	DENY_DB=$HOME/.mail-deny-list
027	NOTIFY=$HOME/Mail/000-notify
028	NDNS=$HOME/Mail/000-ndns
029	TRASH=/dev/null
030	SAVE_ALL=$HOME/.mail_save_all
031

This is the configuration section - pretty straightforward stuff. You'll need to put in your email address and your GMail address; you'll also need to come up with a couple of unique strings (don't worry; these aren't the real ones that I use. :) You could, of course, use the same string - but $SECRET should be something that's easy to type out on, say, your Blackberry whenever you want to validate someone on the spot (we'll see how this works in a moment.)

$DB is your whitelist; $DENY_DB is the blacklist. $NOTIFY - assuming you want to set that up - is mail that you regularly receive (say, monthly notifications from your listbots) but don't want to read; archiving is good enough. $NDNS are Non-Delivery Notifications; for now, I'm collecting those, looking through them monthly, and then tossing them. In another month or so, I'll just trash them, but for now, I'm still in a testing phase. $SAVE_ALL is another testing phase sort of thing: it saves all received email, just so I can go over it and check that everything is getting filtered correctly. Sooner or later, it too is going to disappear.

033	# Immediately deliver anything containing my verification string (the
034	# header is added to all outgoing email via my .muttrc). You should now add
035	# all your email addresses to the blacklist, since anything "from you" that
036	# fails this test is spam.
037	:0:
038	* $ X-Spam-Kill: $SPAM_KILL
039	${DEFAULT}

This is the gadget that delivers all the real email that comes from us; since I use "mutt" for my email client, I simply set it up to add a header with that stamp - i.e., 'X-Spam-Kill: ' followed by my $SPAM_KILL string. This bypasses pretty much all the tests and goes right into my inbox.

041	# This should be either empty, or a regex that matches any addresses from
042	# which you get lots of mail that you want to archive but not read:
043	BOTS=(mailman-owner@list1.com|mailman-owner@list2.com)

Right - this is what we'll be archiving without reading.

045	# This should be a regex that matches all domains from which you know you
046	# won't get spammed:
047	KNOWN_DOMAINS=(safedomain1.com|safedomain2.com|safedomain3.com)$
048	
049	# This should be either empty, or a regex that matches the To: headers of
050	# any mailing lists you're on:
051	LISTS=(list1@lists.net|lists2@lists.net|list@yahoo.com|list@lists.mail.org)

Another rather obvious one. If you use Mutt, like I do, simply copy your 'lists' line here and modify it so that it becomes a valid regular expression, like the above.

All right, here comes the meat of the "program" itself:

053	####################################################################
054	# Don't change anything below unless you know why you're doing it! #
055	####################################################################
056	
057	:0 c
058	$SAVE_ALL

This line saves everything into the file we defined earlier.

060	# You can email yourself to whitelist an address; note use of "secret word" in
061	# subject
062	:0
063	* ^Subject: ${SECRET}-approve \/.*
064	* ? echo $MATCH >> $DB
065	${TRASH}
066	
067	# You can email yourself to blacklist an address; note use of "secret word" in
068	# subject
069	:0
070	* ^Subject: ${SECRET}-deny \/.*
071	* ? sed -i '/^'"$MATCH"'$/d' $DB
072	* ? echo $MATCH >> $DENY_DB
073	${TRASH}

These two recipes allow you to whitelist or blacklist an address by mail: just send yourself an email with the secret word that you defined above, followed by a dash and either the word 'approve' or 'deny' followed by a space and the email address that you want to define. Nice little feature - not that I use it much.

075	# If message is from a blacklisted sender, dump it
076	:0 h
077	# * ? $GREP -i ^$FROM $DENY_DB
078	* ? echo $FROM|$GREP -f $DENY_DB
079	${TRASH}

Other than the "whitelist/blacklist by email" functionality, note this recipe that takes precedence over everything else: if someone is blacklisted, they're gone. Doesn't matter if they're on a whitelisted mailing list that you're subscribed to or anything else; once they earn a place in that file, you'll never see them again.

Incidentally, note the commented-out line (#77): originally, I used the email address as the "grep" search string and the file as the source, and if the string was found in the file, then that was the end of it. However, I discovered that there were times when I wanted to block an entire domain, or use a regular expression to define exactly what I wanted to block - but this was not possible with that recipe! After that, I changed my approach to the one on line #78: I pipe the address into "grep" and use the content of $DENY_DB as the list of regular expressions to check against that string. This allows me to put in, e.g., '@spammer.org' and block that whole domain, or 'joe_slick' and block all addresses containing that string. Do be careful, though: if you accidentally add something like a space to that file, you'll throw away all email!

    For it is the chief characteristic of the religion of science that it
    works, and that such curses as that of [its priests] are really deadly.
     -- Isaac Asimov, "Foundation"
081	# If message is from a bot, archive it
082	:0
083	* BOTS ?? (.)
084	* $ FROM ?? $BOTS
085	${NOTIFY}
086	
087	# If message is a Non-Delivery Notification, archive it
088	:0
089	* MAILER-DAEMON
090	${NDNS}
091	
092	# If message is from a known domain, deliver it
093	:0
094	* KNOWN_DOMAINS ?? (.)
095	* $ FROM ?? $KNOWN_DOMAINS
096	${DEFAULT}
097	
098	# If message is to a list we're on, deliver it
099	:0
100	* LISTS ?? (.)
101	* $ ^TO_$LISTS
102	${DEFAULT}

No surprises there, hopefully; we just distribute the mail to the boxes that we defined according to the rules that we set up for them.

104	# If the message has the "been-filtered-by-Google" stamp, deliver it.
105	# This clause implies that we trust Gmail, but not so much that we'll
106	# auto-whitelist anybody that it passes. If you want to do that as well,
107	# just uncomment the 'echo $FROM' line.
108	:0
109	* $ ^X-Gstamp: $SPAM_KILL
110	# * ? echo $FROM >> $DB
111	${DEFAULT}

As the comment says, this is for all emails that have been validated by GMail. Anything with the 'X-Gstamp:' header (which we add in the next recipe) simply gets delivered.

113	# If sender isn't in the DB, add an X-Gstamp and forward it to GMail for filtering
114	:0 f
115	* $ ! ^X-Loop: $MY_EMAIL
116	* ! ? $GREP -i ^$FROM $DB
117	|$FORMAIL -A"X-Gstamp: $SPAM_KILL"
118	
119	:0 A
120	! $MY_GMAIL
121	
122	#********** END OF GMAIL-BASED ANTI-SPAM SYSTEM **********

If an email has made it through all of the above recipes without being dumped or delivered, then we don't know what it is (ham or spam) - so we'll let GMail decide for us. In theory, this minimizes our privacy exposure, since we should have already whitelisted the people who are likely to send us that kind of important info. Best of all worlds!

Again, the average .procmailrc file will have other things in it - perhaps header fixups for friends with seriously broken email clients, or logic to decide which listmail should go into which mailboxes. If you know how to write procmail recipes, this is all still usable: filters (such as the header fixups) would go just below the procmail variable definitions (say, just below line 10), and list distribution recipes might replace the simple "list delivery" recipe (98-102). If you don't know how, it's relatively simple - and the documentation that comes with procmail is excellent and detailed (see 'man procmail', 'man procmailrc', and 'man procmailex' for lots of good examples and explanations.)

Usage notes

I use "fetchmail" for mail retrieval, so setting that up was pretty trivial: I just grab the mail from my mailhost and from GMail via POP (the latter requires changing the settings at GMail, which is pretty simple.) Since I use Mutt as my mail client, I've added a convenient shortcut to it which allows me to blacklist spam instantly; in fact, it replaced the "spam, not ham" shortcut that I had been using for SpamAssassin. Here are the necessary entries in ~/.muttrc, in case you happen to be using Mutt yourself:

macro index \cb |"/home/ben/bin/blacklist^M"
macro pager \cb |"/home/ben/bin/blacklist^M"

So, if I ever do run across a spam that managed to make it through GMail, all I have to do is hit 'Ctrl-B' - and that address is gone forever. The script that it invokes is a pretty simple one:

#!/bin/bash
# Created by Ben Okopnik on Tue May 11 23:32:58 EDT 2010

FROM=$(perl -wlne'print $1 and last if /^From:\s*.*?([\w\.\-]+@[\w\.\-]+\w+)/')
if [ -n "$FROM" ]
then
	sed -i '/^'"$FROM"'$/d' ~/.mail-accept-list
	echo $FROM >> ~/.mail-deny-list
fi	

Note that if that entry exists in the whitelist, it'll be removed from there. Oh, one more thing for .muttrc: there's also the 'X-Spam-Kill:' header that marks the email as actually coming from me.

# No, this is still not my real X-Spam-Kill string. :)
send-hook ~A 'my_hdr X-Spam-Kill: 74d04eab1341a01117de96f2'

Wrap-up

Taken all together, this forms an easy to use, effective spam killer; I've recovered a number of hours that I used to waste in dealing with spam, and have reduced the wear-and-tear on my nerves caused by finding the occasional business email in my spambox. All in all, I'm really glad that I've spent the time developing and implementing this system.

Feel free to download my .procmailrc file and experiment. I've got to say that I'm pretty excited about this whole system: previously, while retrieving email in the morning, I used to watch my poor little netbook bogging down as SpamAssassin overloaded its tiny brain. In addition, processing even a hundred emails took at least five minutes. Now, when I try to watch my mail log via 'tail -f /var/log/mail.info, the emails fly through the processing so fast that I'd have to be a speed reader to catch them all. The major delay factor in retrieving them is simply the bandwidth/latency of whatever connection I happen to have.

In the near future, once I'm completely satisfied with all the testing, I'm going to try moving this setup off my local system and onto my mail server - given its nature, it's certainly flexible enough to work that way. This will mean using the whitelist/blacklist-by-mail feature and adapting the "blacklist" script to work over the network, or perhaps simply synchronizing the local and the remote lists via a cronjob - but it will also mean much less traffic between my local machine and that mailhost, since all the blacklisted mail will get dumped without me ever downloading it. The GMail-bound traffic will also be sent off from there, meaning that my system will never have to do that round-robin transaction either, so the only thing I'll see is whitelist-validated and GMail-filtered stuff - perhaps a 100-to-1 reduction in volume. I'm really looking forward to that.

Overall, this experiment has made large, positive, time-saving changes in my life; a huge improvement over my previous spam-handling method. Hurrah for Linux and the ability to tweak, play, and experiment!


[1] I could have done this with Perl alone, but I have an additional purpose here: the Perl one-liner that I used is also a nice tool that we can re-use in our .procmailrc - we definitely need to extract the address from each email, right? - so we might as well start using it here.


Share

Talkback: Discuss this article with The Answer Gang


picture

Ben is the Editor-in-Chief for Linux Gazette and a member of The Answer Gang.

Ben was born in Moscow, Russia in 1962. He became interested in electricity at the tender age of six, promptly demonstrated it by sticking a fork into a socket and starting a fire, and has been falling down technological mineshafts ever since. He has been working with computers since the Elder Days, when they had to be built by soldering parts onto printed circuit boards and programs had to fit into 4k of memory (the recurring nightmares have almost faded, actually.)

His subsequent experiences include creating software in more than two dozen languages, network and database maintenance during the approach of a hurricane, writing articles for publications ranging from sailing magazines to technological journals, and teaching on a variety of topics ranging from Soviet weaponry and IBM hardware repair to Solaris and Linux administration, engineering, and programming. He also has the distinction of setting up the first Linux-based public access network in St. Georges, Bermuda as well as one of the first large-scale Linux-based mail servers in St. Thomas, USVI.

After a seven-year Atlantic/Caribbean cruise under sail and passages up and down the East coast of the US, he is currently anchored in northern Florida. His consulting business presents him with a variety of challenges such as teaching professional advancement courses for Sun Microsystems and providing Open Source solutions for local companies.

His current set of hobbies includes flying, yoga, martial arts, motorcycles, writing, Roman history, and mangling playing with his Ubuntu-based home network, in which he is ably assisted by his wife, son and daughter; his Palm Pilot is crammed full of alarms, many of which contain exclamation points.

He has been working with Linux since 1997, and credits it with his complete loss of interest in waging nuclear warfare on parts of the Pacific Northwest.


Copyright © 2010, Ben Okopnik. Released under the Open Publication License unless otherwise noted in the body of the article. Linux Gazette is not produced, sponsored, or endorsed by its prior host, SSC, Inc.

Published in Issue 176 of Linux Gazette, July 2010

Tux