Chapter 1 Introduction - 1.3 Intermezzo Chapters - 《Data Science at the Command Line》

1.3 Intermezzo Chapters

1.3 Intermezzo Chapters

In between the chapters that cover the OSEMN steps, there are three intermezzo chapters. Each intermezzo chapter discusses a more general topic concerning data science, and how the command line is employed for that. These topics are applicable to any step in the data science process.

In Chapter 4, we discuss how to create reusable tools for the command line. These personal tools can come from both long commands that you have typed on the command line, or from existing code that you have written in, say, Python or R. Being able to create your own tools allows you to become more efficient and productive.

Because the command line is an interactive environment for doing data science, it can become challenging to keep track of your workflow. In Chapter 6, we demonstrate a command-line tool called Drake (Factual 2014), which allows you to define your data science workflow in terms of tasks and the dependencies between them. This tool increases the reproducibility of your workflow, not only for you but also for your colleagues and peers.

In Chapter 8, we explain how your commands and tools can be sped up by running them in parallel. Using a command-line tool called GNU Parallel (Tange 2014), we can apply command-line tools to very large data sets and run them on multiple cores and remote machines.