23. Practical: A Spam Filter - Testing the Filter - 《Practical Common Lisp》

Testing the Filter

Testing the Filter

To test the filter, you need a corpus of messages of known types. You can use messages lying around in your inbox, or you can grab one of the corpora available on the Web. For instance, the SpamAssassin corpus12 contains several thousand messages hand classified as spam, easy ham, and hard ham. To make it easy to use whatever files you have, you can define a test rig that’s driven off an array of file/type pairs. You can define a function that takes a filename and a type and adds it to the corpus like this:

(defun add-file-to-corpus (filename type corpus)
  (vector-push-extend (list filename type) corpus))

The value of corpus should be an adjustable vector with a fill pointer. For instance, you can make a new corpus like this:

(defparameter *corpus* (make-array 1000 :adjustable t :fill-pointer 0))

If you have the hams and spams already segregated into separate directories, you might want to add all the files in a directory as the same type. This function, which uses the list-directory function from Chapter 15, will do the trick:

(defun add-directory-to-corpus (dir type corpus)
  (dolist (filename (list-directory dir))
    (add-file-to-corpus filename type corpus)))

For instance, suppose you have a directory mail containing two subdirectories, spam and ham, each containing messages of the indicated type; you can add all the files in those two directories to *corpus* like this:

SPAM> (add-directory-to-corpus "mail/spam/" 'spam *corpus*)
NIL
SPAM> (add-directory-to-corpus "mail/ham/" 'ham *corpus*)
NIL

Now you need a function to test the classifier. The basic strategy will be to select a random chunk of the corpus to train on and then test the corpus by classifying the remainder of the corpus, comparing the classification returned by the classify function to the known classification. The main thing you want to know is how accurate the classifier is—what percentage of the messages are classified correctly? But you’ll probably also be interested in what messages were misclassified and in what direction—were there more false positives or more false negatives? To make it easy to perform different analyses of the classifier’s behavior, you should define the testing functions to build a list of raw results, which you can then analyze however you like.

The main testing function might look like this:

(defun test-classifier (corpus testing-fraction)
  (clear-database)
  (let* ((shuffled (shuffle-vector corpus))
         (size (length corpus))
         (train-on (floor (* size (- 1 testing-fraction)))))
    (train-from-corpus shuffled :start 0 :end train-on)
    (test-from-corpus shuffled :start train-on)))

This function starts by clearing out the feature database.13 Then it shuffles the corpus, using a function you’ll implement in a moment, and figures out, based on the testing-fraction parameter, how many messages it’ll train on and how many it’ll reserve for testing. The two helper functions train-from-corpus and test-from-corpus will both take :start and :end keyword parameters, allowing them to operate on a subsequence of the given corpus.

The train-from-corpus function is quite simple—simply loop over the appropriate part of the corpus, use **DESTRUCTURING-BIND** to extract the filename and type from the list found in each element, and then pass the text of the named file and the type to train. Since some mail messages, such as those with attachments, are quite large, you should limit the number of characters it’ll take from the message. It’ll obtain the text with a function start-of-file, which you’ll implement in a moment, that takes a filename and a maximum number of characters to return. train-from-corpus looks like this:

(defparameter *max-chars* (* 10 1024))
(defun train-from-corpus (corpus &key (start 0) end)
  (loop for idx from start below (or end (length corpus)) do
        (destructuring-bind (file type) (aref corpus idx)
          (train (start-of-file file *max-chars*) type))))

The test-from-corpus function is similar except you want to return a list containing the results of each classification so you can analyze them after the fact. Thus, you should capture both the classification and score returned by classify and then collect a list of the filename, the actual type, the type returned by classify, and the score. To make the results more human readable, you can include keywords in the list to indicate which values are which.

(defun test-from-corpus (corpus &key (start 0) end)
  (loop for idx from start below (or end (length corpus)) collect
        (destructuring-bind (file type) (aref corpus idx)
          (multiple-value-bind (classification score)
              (classify (start-of-file file *max-chars*))
            (list 
             :file file
             :type type
             :classification classification
             :score score)))))