Chapter 4 Creating Reusable Command-line Tools - 4.3 Creating Command-line Tools with Python and R - 《Data Science at the Command Line》

4.3 Creating Command-line Tools with Python and R
- 4.3.1 Porting The Shell Script
- 4.3.2 Processing Streaming Data from Standard Input

4.3 Creating Command-line Tools with Python and R

The command-line tool that we created in the previous section was written in Bash. (Sure, not every feature of the Bash language was employed, but the interpreter still was bash.) As you may know by now, the command line is language agnostic, so we do not necessarily have to use Bash for creating command-line tools.

In this section we are going demonstrate that command-line tools can be created in other programming languages as well. We will focus on Python and R because these are currently the two most popular programming languages within the data science community. We cannot offer a complete introduction to either language, so we assume that you have some familiarity with Python and or R. Programming languages such as Java, Go, and Julia, follow a similar pattern when it comes to creating command-line tools.

There are three main reasons for creating command-line tools in a programming language instead of Bash. First, you may have existing code that you wish be able to use from the command line. Second, the command-line tool would end up encompassing more than a hundred lines of code. Third, the command-line tool needs to be very fast.

The six steps that we discussed in the previous section roughly apply to creating command-line tools in other programming languages as well. The first step, however, would not be copy pasting from the command line, but rather copy pasting the relevant code into a new file. Command-line tools in Python and R need to specify python (Python Software Foundation 2014) and Rscript (R Foundation for Statistical Computing 2014), respectively, as the interpreter after the shebang.

When it comes to creating command-line tools using Python and R, there are two more aspects that deserve special attention, which will be discuss below. First, processing standard input, which comes natural to shell scripts, has to be taken care of explicitly in Python and R. Second, as command-line tools written in Python and R tend to be more complex, we may also want to offer the user the ability to specify more complex command-line arguments.

4.3.1 Porting The Shell Script

As a starting point, let’s see how we would port the prior shell script to both Python and R. In other words, what Python and R code gives us the top most-often used words from standard input? It is not important whether implementing this task in anything else than a shell programming language is a good idea. What matters is that it gives us a good opportunity to compare Bash with Python and R.

We will first show the two files top-words.py and top-words.R and then discuss the differences with the shell code. In Python, the code could would look something like Example 4.5.

Example 4.5 (~/book/ch04/top-words.py)

#!/usr/bin/env python
import re
import sys
from collections import Counter
num_words = int(sys.argv[1])
text = sys.stdin.read().lower()
words = re.split('\W+', text)
cnt = Counter(words)
for word, count in cnt.most_common(num_words):
    print "%7d %s" % (count, word)

Example4.5uses pure Python. When you want to do advanced text processing we recommend you check out the NLTK package (Perkins 2010). If you are going to work with a lot of numerical data, then we recommend you use the Pandas package (McKinney 2012).

And in R, the code would look something like Example 4.4 (thanks to Hadley Wickham):

Example 4.4 (~/book/ch04/top-words-1.R)

#!/usr/bin/env Rscript
n <- as.integer(commandArgs(trailingOnly = TRUE))
f <- file("stdin")
lines <- readLines(f)
words <- tolower(unlist(strsplit(lines, "\\W+")))
counts <- sort(table(words), decreasing = TRUE)
counts_n <- counts[1:n]
cat(sprintf("%7d %s\n", counts_n, names(counts_n)), sep = "")
close(f)

Let’s check that all three implementations (i.e., Bash, Python, and R) return the same top 5 words with the same counts:

$ < data/76.txt top-words.sh 5
   6441 and
   5082 the
   3666 i
   3258 a
   3022 to
$ < data/76.txt top-words.py 5
   6441 and
   5082 the
   3666 i
   3258 a
   3022 to
$ < data/76.txt top-words.R 5
   6441 and
   5082 the
   3666 i
   3258 a
   3022 to

Wonderful! Sure, the output itself is not very exciting. What is exciting is the observation that we can accomplish the same task with multiple approaches. Let’s have a look at the differences between the approaches.

First, what’s immediately obvious is the difference in amount of code. For this specific task, both Python and R require much more code than Bash. This illustrates that, for some tasks, it is better to use the command line. For other tasks, you may better off using a programming language. As you gain more experience on the command-line, you will start to recognize when to use which approach. When everything is a command-line tool, you can even split up the task into subtasks, and combine a Bash command-line tool with a, say, Python command-line tool. Whichever approach works best for the task at hand.

4.3.2 Processing Streaming Data from Standard Input

In the previous two code snippets, both Python R read the complete standard input at once. On the command line, most command-line tools pipe data to the next command-line tool in a streaming fashion. (There are a few command-line tools which require the complete data before they write any data to standard output, like sort and awk (Brennan 1994).) This means the pipeline is blocked by such command-line tools. This does not have to be a problem when the input data is finite, like a file. However, when the input data is a non-stop stream, such blocking command-line tools are useless.

Luckily Python and R support processing streaming data. You can apply a function on a line-per-line basis, for example. Example 4.6 and Example 4.7 are two minimal examples that demonstrate how this works in Python and R, respectively.

Example 4.6 (~/book/ch04/stream.py)

#!/usr/bin/env python
from sys import stdin, stdout
while True:
    line = stdin.readline()
    if not line:
        break
    stdout.write("%d\n" % int(line)**2)
    stdout.flush()

Example 4.7 (~/book/ch04/stream.R)

#!/usr/bin/env Rscript
f <- file("stdin")
open(f)
while(length(line <- readLines(f, n = 1)) > 0) {
        write(as.integer(line)^2, stdout())
}
close(f)