Asynchronous Spam Filtertering
Goals
The following is an explanation of a program I wrote to filter mail after it is delivered the the user's mailbox.
- In the search for spam this script iterates through the SQL database, not the file system.
- Spam is quarrentend or deleted based on the level.
- It creates missing folders automatically.
Operation
This script is run from cron just like the virst incarnation:
*/4 * * * * nice -n 10 /usr/local/bin/scanmail >> /var/log/scan.log 2>&1
The incomming server needs to deliver mail to an IMAP box called .Queue. When scanmailboxes.py runs it will move the mail into the user's INBOX. In the source see the defenitions of queuefolder, mailfolder, and junkfolder. My Exim configuration for delivering this mail then looks like this:
virtual_delivery: driver = appendfile directory = /var/mail/$domain/$local_part/.Queue maildir_format user = vmail group = vmail mode = 0660 directory_mode = 0770
Maildir++ compliance was achieved by following a few simple rules:
- The IMAP directory starts with a period.
- An empty file named maildirfolders is created. This is used to indicated a child relationship to maildir readers.
- Enforce proper mailbox ownership!
The log file will look similar to this:
Tue, 02 Dec 2003 13:27:48 /home/vmail/teisprint.net/sysadmin /home/vmail/teisprint.net/ericradman /home/vmail/teisprint.net/joni 1070376261.VbI19ac4.smtp-gw.teisprint.com -1.0 < 3.9 1070382214.VbI19ddc.smtp-gw.teisprint.com -4.6 < 3.9 1070379942.VbI19d5e.smtp-gw.teisprint.com Deleting 1070379942.VbI19d5e.smtp-gw.teisprint.com 1070383285.VbI19e08.smtp-gw.teisprint.com 5.0 > 3.9 1070385124.VbI19fae.smtp-gw.teisprint.com 2.3 < 3.9 1070381076.VbI19db3.smtp-gw.teisprint.com 6.4 > 3.9 1070382381.VbI19df4.smtp-gw.teisprint.com 8.6 > 3.9 1070377027.VbI19b23.smtp-gw.teisprint.com 7.9 > 3.9 1070378886.VbI19d4a.smtp-gw.teisprint.com 7.0 > 3.9 1070383012.VbI19fa2.smtp-gw.teisprint.com 7.0 > 3.9 1070383060.VbI19fa3.smtp-gw.teisprint.com -4.9 < 3.9 1070385317.VbI19fd4.smtp-gw.teisprint.com 2.7 < 3.9 1070385506.VbI19fd9.smtp-gw.teisprint.com 6.6 > 3.9
The first number indicates the spam rating that Spam Assassin gave the message, and the second number is the permissible value for that user extracted from the SQL database.
I added the following parameters to cd /etc/mail/spamassassin/local.cf in order to prevent Spam Assassin from rewriting the message and from checking rebel lists:
rewrite_subject 0 report_safe 0
Of course spamd must be running for the spamc analyzer to function.
$ ps -x | grep spam 6442 ? S 0:00 \ /usr/bin/perl /usr/bin/spamd -d -r /var/run/spamd.pid -a -c
Statistics on the filtering operation are kept in the database:
system# SELECT username,spamlimit FROM users WHERE spamlimit<5; ----------+----------- username | spamlimit ----------+----------- sysadmin | 3.9 joni | 3.9 ----------+-----------
The Code
PID Detection
Steuben Technologies put this script in to run the task so that more than one copy never runs:
#!/bin/sh
if [ -f /tmp/scanmail.pid ]
then
ps -p `cat /tmp/scanmail.pid` >/dev/null
if [ $? -eq 1 ] # the process is dead start another one
then
# echo "The process is Dead"
# echo $$
echo $$ >/tmp/scanmail.pid
/usr/local/bin/scanmailboxes.py
else
echo "Scanmail is already running exiting"
exit 1
fi
else #pid file doesn't even exist start process
echo $$ >/tmp/scanmail.pid
echo "Pid file not found"
echo "Running scanmail"
/usr/local/bin/scanmailboxes.py
fi
Mailbox Cleanup
This is not really related, but move_old_boxes.rb is a nice little script that removes mailboxes and all their spam that is no longer used. By default it move everything into a folder called /archive. I'm not 100% comfortable with a script that quielty removes files and maildir trees.