OS/2 eZine - http://www.os2ezine.com
Spacer
March 16, 2004
 
Robert Basler is the president of Aurora Systems, Inc. and has been a dedicated OS/2 user since he tired of rebooting Windows 3.1 twenty times a day. He spends what free time he can manage travelling the world. Photo was taken at Franz Josef glacier, New Zealand.

If you have a comment about the content of this article, please feel free to vent in the OS/2 e-Zine discussion forums.

There is also a Printer Friendly version of this page.



Spacer
Previous Article
Home
Next Article


Injoy Firewall


Bayesian Email Filtering with Bogofilter and Weasel

A couple of years ago, I wrote about JunkSpy 2.0 which I found to be a great help for the deluge of spam I get. Recently however, I found that more and more spam was making it past JunkSpy to my inbox. JunkSpy still got a lot of it, but I get about 500 spam a week, and enough were getting through to be a real nuisance.

I had heard a lot about Bayesian filtering in magazines, which learns to identify spam by analyzing the messages you receive, and as a developer of learning software, this sounded like a really common-sense approach to the the problem of stopping spam, so after trying a couple of client-side versions, I settled on bogofilter which is available from Hobbes.

I am running Weasel 1.50 as my mail server, newer versions use a different style of REXX filter, so you'll need to adapt the concepts used in the SPAMPROCESS.CMD script if you are using those servers. When you specify the SPAMPROCESS.CMD filter in Weasel, make sure to have Weasel serialize the use of the filter, as it uses a fixed file for its processing and it won't work reliably if you don't. You'll also have to adjust the script to use the folders where your mail accounts are set up.

How the Filter Script Works

The first section of the spamprocess.cmd script checks for outgoing mail by looking to see if the mail is destined for the forward directory and passes it through unfiltered.

The second section takes any mail to the postmaster account, a must for mail servers, and routes it all to a folder for spam so I can feed it to bogofilter later. I have a nice cron job which deletes all postmaster mail every night. I know that's some sort of computer geek sin, but that account gets twice as much junk as any other account and I really just don't have time to deal with it. By the way, don't just delete your postmaster account, blackhole lists really don't like servers without postmaster accounts.

The next section actually runs bogofilter.exe and makes a copy of the source message in the file temp.msg with the bogofilter header tags added that mark the note as spam or not. You'll note the -3 command line parameter, this is supposed to make bogofilter mark spam as either spam, not spam, or unsure. Unfortunately I've never seen a note marked as unsure, so I don't think this actually works.

The next three sections put a message on the weasel console which indicates what bogofilter thought of the message, move the marked up copy to the proper destination folder and make a copy of the original note in one of three folders I set up for training: SPAM, NOSPAM or UNSURE. That's about it for the filter script.

Training

The biggest drawback to using bogofilter is that you have to train it before it is going to be any good at all. For some reason, it doesn't come pretrained. To get bogofilter going, I fed it about 5000 non-spam email notes extracted from my email application. Initially I didn't have any spam messages to feed it, and as a result, bogofilter marked everything as good.

bogofilter comes with two REXX scripts: train-no-spam.cmd and train-spam.cmd. To train bogofilter, you just run the scripts against the directory where your spam/not spam .MSG files are just like this:

train-spam e:\spammsgs
train-no-spam e:\notspammsgs

The big trick of course is separating spam from non-spam so that you have nice clean samples for training. According to the documentation it is really important not to misidentify messages while training.

The spamprocess.cmd script puts messages into what it thinks is the right folders. When I want to do training, it is important to be able to pick any known-good messages out of the spam and known bad messages out of the good. To do this, I wrote an additional script called presort.cmd. It has a bunch of grep lines, these search all the MSG files for terms I know mean they are good messages and puts the names of those MSG files into a file called good.cmd. It does the same for junk and puts them into bad.cmd. You can do this for the sort of mail you get the first time you go through the .MSG files to sort them.

grep -i -l openal *.msg >good.cmd
grep -i -l dxr3 *.msg >>good.cmd
grep -i -l os2ddprog *.msg >>good.cmd
grep -i -l "postmaster@aurora-systems.com" *.msg >bad.cmd

Once these .CMD files are created, they are just a list of filenames, so I use EPM's search and replace to insert MOVE commands to move the good and bad files to the proper folders ready for training. The rest of the messages I then scan through quickly with EPM (I love the way it loads multiple files) to make sure everything is where it should be. This is still time-consuming, but at least most of the work is done for you. I can go through 500 messages in about 10 minutes after they are presorted.

Setting up your Email Client

bogofilter works by adding additional headers to email messages that your email program's filters can use to tell if bogofilter thinks a message is spam or not. Each message gets one of these lines:

  X-Bogosity: Yes, tests=bogofilter, spamicity=1.000000, version=0.14.5.4

I just have the email client look for the X-bogosity: Yes part, the rest of it is interesting, but not of much practical use. I have the filter move the messages into a spam folder which I look at once a week or so just to make sure no mistakes have been made.

Wrapping Up

Plan to have a lot of messages to train bogofilter. It took a couple thousand messages of each type before bogofilter started to work for me. I'm really concerned about getting real messages marked as spam, so I let bogofilter run for a couple of weeks, collecting messages while I looked at what it thought of my mail before I started having my email application use it as a basis for filtering. The payoff of waiting was that so far, bogofilter has not classified one real message as spam.

I still run JunkSpy 2.0 on the client because it catches notes that bogofilter doesn't. After feeding about 2000 spam to bogofilter for training, bogofilter catches about 85% of the spam I get, Junkspy catches nearly all of what bogofilter misses, and typically about 10 messages out of the 500 or so spam I get each week make it through to my inbox.

Shortly after I implemented this solution, the folks at JunkSpy came out with a new version 3.0 which I haven't tried yet but they claim it does a lot better job than previous versions.

Previous Article
Home
Next Article

Copyright (C) 2004. All Rights Reserved.