Chapter 4 Creating Reusable Command-line Tools - 4.2 Converting One-liners into Shell Scripts - 《Data Science at the Command Line》

4.2 Converting One-liners into Shell Scripts

4.2 Converting One-liners into Shell Scripts

In this section we are going to explain how to turn a one-liner into a reusable command-line tool. Imagine that we have the following one-liner:

$ curl -s http://www.gutenberg.org/files/76/76-0.txt |
> tr '[:upper:]' '[:lower:]' | 
> grep -oE '\w+' |             
> sort |                       
> uniq -c |                    
> sort -nr |                   
> head -n 10                   
   6441 and
   5082 the
   3666 i
   3258 a
   3022 to
   2567 it
   2086 t
   2044 was
   1847 he
   1778 of

In short, as you may have guessed from the output, this one-liner returns the top ten words of the e-book version of Adventures of Huckleberry Finn. It accomplishes this by:

Downloading an ebook using curl.
Converting the entire text to lowercase using tr (Meyering 2012 c).
Extracting all the words using grep (Meyering 2012 a) and put each word on separate line.
Sort these words in alphabetical order using sort (Haertel and Eggert 2012).
Remove all the duplicates and count how often each word appears in the list using uniq (Stallman and MacKenzie 2012 b).
Sort this list of unique words by their count in descending order using sort.
Keep only the top 10 lines (i.e., words) using head.

Each command-line tool used in this one-liner offers a man page. So in case you would like to know more about, say, grep, you can run man grep from the command line. The command-line tools tr, grep, uniq, and sort will be discussed in more detail in the next chapter.

There is nothing wrong with running this one-liner just once. However, imagine if we wanted to have the top 10 words of every e-book on Project Gutenberg. Or imagine that we wanted the top 10 words of a news website on a hourly basis. In those cases, it would be best to have this one-liner as a separate building block that can be part of something bigger. Because we want to add some flexibility to this one-liner in terms of parameters, we will turn it into a shell script.

Because we use Bash as our shell, the script will be written in the programming language Bash. This allows us to take the one-liner as the starting point, and gradually improve on it. To turn this one-liner into a reusable command-line tool, we’ll walk you through the following six steps:

Copy and paste the one-liner into a file.
Add execute permissions.
Define a so-called shebang.
Remove the fixed input part.
Add a parameter.
Optionally extend your PATH.

4.2.1 Step 1: Copy and Paste

The first step is to create a new file. Open your favorite text editor and copy and paste our one-liner. We use name the file top-words-1.sh (The 1 stands for the first step towards our new command-line tool), and put it in the ~/book/ch04 directory, but you may choose a different name and location. The contents of the file should look something like Example 4.1.

Example 4.1 (~/book/ch04/top-words-1.sh)

curl -s http://www.gutenberg.org/files/76/76-0.txt |
tr '[:upper:]' '[:lower:]' | grep -oE '\w+' | sort |
uniq -c | sort -nr | head -n 10

We are using the file extension .sh to make clear that we are creating a shell script. However, command-line tools do not need to have an extension. In fact, command-line tools rarely have extensions.

Here is a nice little command-line trick. On the command-line, !! will be substituted with the previous command. So, if you realize you needed superuser privileges for the previous command, you can run sudo !! (Miller 2013). And if you want to save the previous command into a file without have to copy and paste it, you can run echo "!!" > scriptname. Be sure to check the contents of the file scriptname for correctness before executing it because it may not always work when your command has quotes.

We can now use bash (Fox and Ramey 2010) to interpret and execute the commands in the file:

$ bash book/ch04/top-words-1.sh
   6441 and
   5082 the
   3666 i
   3258 a
   3022 to
   2567 it
   2086 t
   2044 was
   1847 he
   1778 of

This already saves us from typing the one-liner. Because the file cannot be executed on its own, we cannot really speak of a true command-line tool, yet. Let us change that in the next step.

4.2.2 Step 2: Add Permission to Execute

The reason we cannot execute our file directly is that we do not have the correct access permissions. In particular, you, as a user, need to have the permission to execute the file. In this section we change the access permissions of our file.

In order to compare differences between steps, we copy the file to top-words-2.sh using cp top-words-{1,2}.sh. You can keep working with the same file if you want to.

To change the access permissions of a file, we need to use a command-line tool called chmod (MacKenzie and Meyering 2012 a), which stands for change mode. It changes the file mode bits of a specific file. The following command gives the user, you, the permission to execute top-words-2.sh:

$ cd ~/book/ch04/
$ chmod u+x top-words-2.sh

The command-line argument u+x consists of three characters: (1) u indicates that we want to change the permissions for the user who owns the file, which is you, because you created the file; (2) + indicates that we want to add a permission; and (3) x, which indicates the permissions to execute. Let us now have a look at the access permissions of both files:

$ ls -l top-words-{1,2}.sh
-rw-rw-r-- 1 vagrant vagrant 145 Jul 20 23:33 top-words-1.sh
-rwxrw-r-- 1 vagrant vagrant 143 Jul 20 23:34 top-words-2.sh

The first column shows the access permissions for each file. For top-words-2.sh, this is -rwxrw-r—. The first character - indicates the file type. A - means regular file and a d means directory. The next three characters rwx indicate the access permissions for the user who owns the file. The r and w mean read and write respectively. (As you can see, top-words-1.sh has a - instead of an x, which means that we cannot execute that file.) The next three characters rw- indicate the access permissions for all members of the group that owns the file. Finally, the last three characters in the column r— indicate access permissions for all other users.

Now you can execute the file as follows:

$ book/ch04/top-words-2.sh
   6441 and
   5082 the
   3666 i
   3258 a
   3022 to
   2567 it
   2086 t
   2044 was
   1847 he
   1778 of

Note that if you’re ever in the same directory as the executable, you need to execute it as follows:

$ cd ~/book/ch04
$ ./top-words-2.sh

If you try to execute a file for which you do not have the correct access permissions, as with top-words-1.sh, you will see the following error message:

$ ./top-words-1.sh
bash: ./top-words-1.sh: Permission denied

4.2.3 Step 3: Define Shebang

Although we can already execute the file on its own, we should add a so-called shebang to the file. The shebang is a special line in the script, which instructs the system which executable should be used to interpret the commands.

In our case we want to use bash to interpret our commands. Example 4.2 shows what the file top-words-3.sh looks like with a shebang.

Example 4.2 (~/book/ch04/top-words-3.sh)

#!/usr/bin/env bash
curl -s http://www.gutenberg.org/files/76/76-0.txt |
tr '[:upper:]' '[:lower:]' | grep -oE '\w+' | sort |
uniq -c | sort -nr | head -n 10

The name shebang comes from the first two characters: a hash (she) and an exclamation mark (bang). It is not a good idea to leave it out, as we have done in the previous step, because then the behavior of the script is undefined. The Bash shell, which is the one that we are using, uses the executable /bin/sh by default. Other shells may have different defaults.

Sometimes you will come across scripts that have a shebang in the form of !/usr/bin/bash or !/usr/bin/python (in the case of Python, as we will see in the next section). While this generally works, if the bash or python (Python Software Foundation 2014) executables are installed in a different location than /usr/bin, then the script does not work anymore. It is better to use the form that we present here, namely !/usr/bin/env bash and !/usr/bin/env python, because the env (Mlynarik and MacKenzie 2012) executable is aware where bash and python are installed. In short, using env makes your scripts more portable.

4.2.4 Step 4: Remove Fixed Input

We know have a valid command-line tool that we can execute from the command line. But we can do better than this. We can make our command-line tool more reusable. The first command in our file is curl, which downloads the text from which we wish to obtain the top 10 most-used words. So, the data and operations are combined into one.

What if we wanted to obtain the top 10 most-used words from another e-book, or any other text for that matter? The input data is fixed within the tools itself. It would be better to separate the data from the command-line tool.

If we assume that the user of the command-line tool will provide the text, it will become generally applicable. So, the solution is to simply remove the curl command from the script. See Example 4.3 for the updated script named top-words-4.sh.

Example 4.3 (~/book/ch04/top-words-4.sh)

#!/usr/bin/env bash
tr '[:upper:]' '[:lower:]' | grep -oE '\w+' | sort |
uniq -c | sort -nr | head -n 10

This works because if a script starts with a command that needs data from standard input, like tr, it will take the input that is given to the command-line tools. For example:

$ cat data/finn.txt | top-words-4.sh

Although we have not done so in our script, the same principle holds for saving data. It is, in general, better to let the user take care of that. Of course, if you intend to use a command-line tool only for own projects, then there are no limits to how specific you can be.

4.2.5 Step 5: Parametrize

There is one more step that we can perform in order to make our command-line tool even more reusable: parameters. In our command-line tool there are a number of fixed command-line arguments, for example -nr for sort and -n 10 for head. It is probably best to keep the former argument fixed. However, it would be very useful to allow for different values for the head command. This would allow the end user to set the number of most-often used words to be outputted. Example 4.4 shows what our file top-words-5.sh looks like if we parametrize head.

Example 4.4 (~/book/ch04/top-words-5.sh)

#!/usr/bin/env bash
NUM_WORDS="$1"                                        
tr '[:upper:]' '[:lower:]' | grep -oE '\w+' | sort |
uniq -c | sort -nr | head -n $NUM_WORDS

The variable NUM_WORDS is set to the value of $1, which is a special variable in Bash. It holds the value of the first command-line argument passed to our command-line tool. The table below lists the other special variables that Bash offers.
Note that in order to use the value of the $NUM_WORDS variable, you need to put a dollar sign in front of it. When you set it, you do not write a dollar sign.We could have also used $1 directly as an argument for head and not bother creating an extra variable such NUM_WORDS. However, with larger scripts and a few more command-line arguments such as $2 and $3, the code becomes more readable when you use named variables.

Now if we wanted to see the top 5 most-used words of our text, we would invoke our command-line tool as follows:

$ cat data/finn.txt | top-words-5 5

If the user does not provide an argument then head will return an error message, because the value of $1, and therefore $NUM_WORDS will be an empty string.

$ cat data/finn.txt | top-words-5
head: option requires an argument -- 'n'
Try 'head --help' for more information.

4.2.6 Step 6: Extend Your PATH

After the previous five steps we are finally finished building a reusable command-line tool. There is, however, one more step that can be very useful. In this optional step we are going to ensure that you can execute your command-line tools from everywhere.

Currently, when you want to execute your command-line tool, you either have to navigate to the directory it is in or include the full path name as shown in step 2. This is fine if the command-line tool is specifically built for, say, a certain project. However, if your command-line tool could be applied in multiple situations, then it is useful to be able to execute form everywhere, just like the command-line tools that come with Ubuntu.

To accomplish this, Bash needs to know where to look for your command-line tools. It does this by traversing a list of directories which are stored in an environment variable called PATH. In a fresh Data Science Toolbox, the PATH looks like this:

$ echo $PATH | fold

The directories are delimited by colons. Here is the list of directories:

$ echo $PATH | tr ':' '\n'

To change the PATH permanently, you’ll need to edit the .bashrc or .profile file located in your home directory. If you put all your custom command-line tools into one directory, say, ~/tools, then you only change the PATH once. As you can see, the Data Science Toolbox already has /home/vagrant/.bin in its PATH. Now, you no longer need to add the ./, but you can just use the filename. Moreover, you do no longer need to remember where the command-line tool is located.