23. Practical: A Spam Filter - What’s Next - 《Practical Common Lisp》

What’s Next

What’s Next

Obviously, you could do a lot more with this code. To turn it into a real spam-filtering application, you’d need to find a way to integrate it into your normal e-mail infrastructure. One approach that would make it easy to integrate with almost any e-mail client is to write a bit of code to act as a POP3 proxy—that’s the protocol most e-mail clients use to fetch mail from mail servers. Such a proxy would fetch mail from your real POP3 server and serve it to your mail client after either tagging spam with a header that your e-mail client’s filters can easily recognize or simply putting it aside. Of course, you’d also need a way to communicate with the filter about misclassifications—as long as you’re setting it up as a server, you could also provide a Web interface. I’ll talk about how to write Web interfaces in Chapter 26, and you’ll build one, for a different application, in Chapter 29.

Or you might want to work on improving the basic classification—a likely place to start is to make extract-features more sophisticated. In particular, you could make the tokenizer smarter about the internal structure of e-mail—you could extract different kinds of features for words appearing in the body versus the message headers. And you could decode various kinds of message encoding such as base 64 and quoted printable since spammers often try to obfuscate their message with those encodings.

But I’ll leave those improvements to you. Now you’re ready to head down the path of building a streaming MP3 server, starting by writing a general-purpose library for parsing binary files.

1Available at http://www.paulgraham.com/spam.html and also in Hackers & Painters: Big Ideas from the Computer Age (O’Reilly, 2004)

2There has since been some disagreement over whether the technique Graham described was actually “Bayesian.” However, the name has stuck and is well on its way to becoming a synonym for “statistical” when talking about spam filters.

3It would, however, be poor form to distribute a version of this application using a package starting with com.gigamonkeys since you don’t control that domain.

4A version of CL-PPCRE is included with the book’s source code available from the book’s Web site. Or you can download it from Weitz’s site at http://www.weitz.de/cl-ppcre/.

5The main reason to use **PRINT-UNREADABLE-OBJECT** is that it takes care of signaling the appropriate error if someone tries to print your object readably, such as with the ~S **FORMAT** directive.

6**PRINT-UNREADABLE-OBJECT** also signals an error if it’s used when the printer control variable ***PRINT-READABLY*** is true. Thus, a **PRINT-OBJECT** method consisting solely of a **PRINT-UNREADABLE-OBJECT** form will correctly implement the **PRINT-OBJECT** contract with regard to ***PRINT-READABLY***.

7If you decide later that you do need to have different versions of increment-feature for different classes, you can redefine increment-count as a generic function and this function as a method specialized on word-feature.

8Technically, the key in each clause of a **CASE** or **ECASE** is interpreted as a list designator, an object that designates a list of objects. A single nonlist object, treated as a list designator, designates a list containing just that one object, while a list designates itself. Thus, each clause can have multiple keys; **CASE** and **ECASE** will select the clause whose list of keys contains the value of the key form. For example, if you wanted to make good a synonym for ham and bad a synonym for spam, you could write increment-count like this:

(defun increment-count (feature type)
  (ecase type
    ((ham good) (incf (ham-count feature)))
    ((spam bad) (incf (spam-count feature)))))

9Speaking of mathematical nuances, hard-core statisticians may be offended by the sometimes loose use of the word probability in this chapter. However, since even the pros, who are divided between the Bayesians and the frequentists, can’t agree on what a probability is, I’m not going to worry about it. This is a book about programming, not statistics.

10Robinson’s articles that directly informed this chapter are “A Statistical Approach to the Spam Problem” (published in the Linux Journal and available at http://www.linuxjournal.com/ article.php?sid=6467 and in a shorter form on Robinson’s blog at http://radio.weblogs.com/ 0101454/stories/2002/09/16/spamDetection.html) and “Why Chi? Motivations for the Use of Fisher’s Inverse Chi-Square Procedure in Spam Classification” (available at http://garyrob.blogs.com/ whychi93.pdf). Another article that may be useful is “Handling Redundancy in Email Token Probabilities” (available at http://garyrob.blogs.com//handlingtokenredundancy94.pdf). The archived mailing lists of the SpamBayes project (http://spambayes.sourceforge.net/) also contain a lot of useful information about different algorithms and approaches to testing spam filters.

11Techniques that combine nonindependent probabilities as though they were, in fact, independent, are called naive Bayesian. Graham’s original proposal was essentially a naive Bayesian classifier with some “empirically derived” constant factors thrown in.

12Several spam corpora including the SpamAssassin corpus are linked to from http://nexp.cs.pdx.edu/~psam/cgi-bin/view/PSAM/CorpusSets.

13If you wanted to conduct a test without disturbing the existing database, you could bind *feature-database*, *total-spams*, and *total-hams* with a **LET**, but then you’d have no way of looking at the database after the fact—unless you returned the values you used within the function.

14This algorithm is named for the same Fisher who invented the method used for combining probabilities and for Frank Yates, his coauthor of the book Statistical Tables for Biological, Agricultural and Medical Research (Oliver & Boyd, 1938) in which, according to Knuth, they provided the first published description of the algorithm.