Chapter 8 Parallel Pipelines

In the previous chapters, we have been dealing with commands and pipelines that take care of an entire task at once. In practice, however, you may find yourself facing a task which requires the same command or pipeline to run multiple times. For, example, you may need to:

  • Scrape hundreds of web pages.
  • Make dozens of API calls and transform their output.
  • Train a classifier for a range of parameter values.
  • Generate scatter plots for every pair of features in your dataset.

In any of the above examples, there is a certain form of repetition involved. With your favorite scripting or programming language, you take care of this with a for loop or a while loop. On the command line, the first thing you might be inclined to do is to press <Up> (which brings back the previous command), modify it if necessary, and press <Enter> (which runs the command again). This is fine for two or three times, but imagine doing this for, say, dozens of files. Such an approach quickly becomes cumbersome and time-inefficient. The good news is that we can write such loops on the command line as well. This chapter is all about repetition.

Sometimes, repeating a fast command on after the other (in serial) is sufficient. When you have multiple cores (and perhaps even multiple machines) it would be nice if you could make use of those, especially when you’re faced with a data-intensive task. When using multiple cores or machines, the total running time of may be reduced significantly. In this chapter we will introduce a very powerful tool called GNU Parallel that can take care of exactly this. GNU Parallel allows us to apply a command or pipeline with a range of arguments such as numbers, lines, and files. Plus, it allows us to run our commands in parallel.