Chapter 8 Parallel Pipelines - 8.2 Serial Processing - 《Data Science at the Command Line》

8.2 Serial Processing

8.2 Serial Processing

Before we dive into parallelization, we will look at looping in a serial fashion. It’s worthwhile to know how to do this because this functionality is always available, the syntax closely resembles looping in other programming languages, and it will really make you appreciate the tool GNU Parallel.

From the examples provided in the introduction of this chapter, we can distill three types of items to loop over: (1) numbers, (2) lines, and (3) files. These three types of items will be discussed in the next three subsections, respectively.

8.2.1 Looping Over Numbers

Imagine that we need to compute the square of every even integer between 0 and 100. There’s a tool called bc, which is basically a calculator on the command line where you can pipe an equation to. The command to compute the square of 4 looks as follows:

$ echo "4^2" | bc
16

For a one-off calculation, this is perfect. However, as mentioned in the introduction, we would be creazy to press <Up>, change the number, and press <Enter> 51 times! In this case it is better to let Bash do the hard work for us by using a for loop:

$ for i in {0..100..2}  
> do
> echo "$i^2" | bc      
> done | tail           
6724
7056
7396
7744
8100
8464
8836
9216
9604
10000

There are a number of things going on here:

Bash has a feature called brace expansion, which transforms {0..100..2} into a list separated by spaces: 0 2 4 … 98 100.
The variable i is assigned the value 1 in the first iteration, 2 in the second iteration, and so forth. The value of this variable can be employed in commands by prefixing it with a dollar sign $. The shell will replace $i with its value before echo is being executed. Note that there can be more than one command between do and done.
We pipe the output of the for loop to tail so that we see the last ten values, only. Although the syntax may appear a bit odd compared to your favorite programming language, it is worth remembering this because it is always available in the bash shell. We will shortly introduce a better and more flexible way of repeating commands.

8.2.2 Looping Over Lines

The second type of items we can loop over are lines. These lines can come from either a file or from standard input. This is a very generic approach because the lines can contain anything, including: numbers, dates, and email adresses.

Imagine that we want to send an email to our customers. Let’s generate some fake users using the https://randomuser.me/ API:

$ curl -s "https://randomuser.me/api/1.2/?results=5" > data/users.json
$ < data/users.json jq -r '.results[].email' > data/emails.txt
$ cat data/emails.txt
kaylee.anderson64@example.com
arthur.baker92@example.com
chloe.graham66@example.com
wyatt.nelson80@example.com
peter.coleman75@example.com

We can loop over the lines from emails.txt with a while-loop:

$ while read line                                       
> do
> echo "Sending invitation to ${line}."                 
> done < data/emails.txt                                
Sending invitation to kaylee.anderson64@example.com.
Sending invitation to arthur.baker92@example.com.
Sending invitation to chloe.graham66@example.com.
Sending invitation to wyatt.nelson80@example.com.
Sending invitation to peter.coleman75@example.com.

In this case we need to use a while loop because Bash does not know beforehand how many lines the input consists of.
Although the curly braces around the line variable are not necessary in this case (since variable names cannot contain periods), it’s still good practice.
This redirection can also be placed before while. You can also provide input to the while loop interactively by specifying the special file standard input /dev/stdin. Press <Ctrl-D> when you are done.

$ while read i; do echo "You typed: $i."; done < /dev/stdin
one
You typed: one.
two
You typed: two.
three
You typed: three.

This method, however, has the disadvantage that, once you press <Enter>, the command(s) between do and done are run immediately for that line of input.

8.2.3 Looping Over Files

In this section we discuss the third type of item that we often need to loop over: files.

To handle special characters, use globbing (i.e., pathname expansion) instead of ls:

$ for filename in *.csv
> do
> echo "Processing ${filename}."
> done
Processing countries.csv.

Just as with brace expansion with numbers, the *.csv is first expanded into a list before it is being processed by the for loop.

A more elaborate alternative to finding files is find (Youngman 2008), which:

Allows for elaborate searching on properties such as size, access time, and permissions.
Handles dashes.
Handles special characters such as spaces and newlines.

$ find data -name '*.csv' -exec echo "Processing {}" \;
Processing data/countries.csv
Processing data/movies.csv
Processing data/top250.csv

Here’s the same but then using parallel:

$ find data -name '*.csv' -print0 | parallel -0 echo "Processing {}"
Processing data/countries.csv
Processing data/movies.csv
Processing data/top250.csv

The -print0 option allows file names that contain newlines or other types of white space to be correctly interpreted by programs that process the find output. If you are absolutely certain that the filenames contain no special characters such as spaces and newlines, then you can omit -print0 and -0 options.

If the list to process becomes too complex, you can always store the result into a temporary file and then use the method to loop over lines from a file.