Chapter 6 Managing Your Data Workflow

We hope that by now, you have come to appreciate that the command line is a very convenient environment for exploratory data analysis. You may have noticed that, as a consequence of working with the command line, we:

  • Invoke many different commands.
  • Create custom command-line tools.
  • Obtain and generate many (intermediate) files.

As this process is of an exploratory nature, our workflow tends to be rather chaotic, which makes it difficult to keep track of what we’ve done. It is very important that our steps can be reproduced, whether that is by ourselves or by others. When we, for example, continue with a project from a few weeks earlier, chances are that we have forgotten which commands we have ran, on which files, in which order, and with which parameters. Imagine the difficulty passing on your analysis to a collaborator.

You may recover some lost commands by digging into your Bash history, but this is, of course, not a good approach. A better approach would be to save your commands to a shell script run.sh. This allows you and your collaborators to at least reproduce the analysis. A shell script is, however, a sub-optimal approach because:

  • It is difficult to read and to maintain.
  • Dependencies between steps are unclear.
  • Every step gets executed every time, which is inefficient and sometimes undesirable. This is where Drake comes in handy (Factual 2014). Drake is command-line tool created by Factual that allows you to:

  • Formalize your data workflow steps in terms of input and output dependencies.

  • Run specific steps of your workflow from the command line.
  • Use inline code.
  • Store and retrieve data from external sources.