Example: Hitchhikers Guide

This is an example of using the threading macros and a REPL to give fast feedback as you are developing code.

Note Write functions that will give a list of the most used words used in a book, excluding the common English words like “the, and, it, I”. Join those functions with a threading macro.

Suggest you use the assumed perfectly legal copy of the Hitchhickers book text using the slurp function

Approximate algorithm

  • Use a regular expression to create a collection of individual words - eg. #”[a-zA-Z0-9|’]+”
  • Convert all the words to lower case so they match with common words source - clojure.string/lower-case
  • Remove the common English words used in the book, leaving more context specific words
  • Calculate the frequencies of the remaining words, returning a map of word & word count pairs
  • Sort-by word count values in the map
  • Reverse the collection so the most commonly used word is the first element in the map
  1. (def book (slurp "http://clearwhitelight.org/hitch/hhgttg.txt"))
  2. (def common-english-words
  3. (-> (slurp "http://www.textfixer.com/resources/common-english-words.txt")
  4. (clojure.string/split #",")
  5. set))
  6. ;; using a function to pull in any book
  7. (defn get-book [book-url]
  8. (slurp book-url))
  9. (defn -main [book-url]
  10. (->> (get-book book-url)
  11. (re-seq #"[a-zA-Z0-9|']+")
  12. (map #(clojure.string/lower-case %))
  13. (remove common-english-words)
  14. frequencies
  15. (sort-by val)
  16. reverse))
  17. ;; Call the program
  18. (-main "http://clearwhitelight.org/hitch/hhgttg.txt")

Deconstructing the code in the repl

To understand what each of the functions do in the -main function then you can simply comment out one or more expressions using in front of the expression #_

  1. (defn -main [book-url]
  2. (->> (get-book book-url)
  3. #_(re-seq #"[a-zA-Z0-9|']+")
  4. #_(map #(clojure.string/lower-case %))
  5. #_(remove common-english-words)
  6. #_frequencies
  7. #_(sort-by val)
  8. #_reverse))

Now the -main function will only return the result of the (get-book book-url) function. To see what each of the other lines do, simply remove the #_ character from the front of an expression and re-evaluate the -main function in the repl

Hint In Spacemacs / Emacs, the keybinding C-c C-p show the output in a seperate buffer. Very useful when the function returns a large results set.

Off-line sources of Hitchhickers book and common English words

  1. (def book (slurp "./hhgttg.txt"))
  2. (def common-english-words
  3. (-> (slurp "common-english-words.txt")
  4. (clojure.string/split #",")
  5. set))

Original concept from Misophistful: Understanding thread macros in clojure

Hint The slurp function holds the contents of the whole file in memory, so it may not be appropriate for very large files. If you are dealing with a large file, consider wrapping slurp in a lazy evaluation or use Java IO (eg. java.io.BufferedReader, java.io.FileReader.). See the Clojure I/O cookbook and The Ins & Outs of Clojure for examples.