Chapter 6 Managing Your Data Workflow - 6.5 Every Workflow Starts with a Single Step - 《Data Science at the Command Line》

6.5 Every Workflow Starts with a Single Step

6.5 Every Workflow Starts with a Single Step

In this section we’ll convert the above command into a Drake workflow. A workflow is just a text file. You’d usually name this file Drakefile because Drake uses that file if no other file is specified at the command line. A workflow with just a single step would look like Example 6.1.

Example 6.1 (A workflow with just a single stip)

top-5 <-                                                    
    curl -s 'http://www.gutenberg.org/browse/scores/top' |  
    grep -E '^<li>' |                                       
    head -n 5 |                                             
    sed -E "s/.*ebooks\/([0-9]+).*/\\1/" > top-5

Let’s go through this file. The first line, which contains the arrow pointing to the left, is our step definition. The left side of this arrow, which says top-5, is the name or output of this step. Any inputs to this step would appear on the right side of this arrow, but since this step has no input, it’s empty. Defining inputs and outputs is what allows Drake to recognize the dependencies between steps, and to figure out whether and when which steps need to be executed in order to fulfill a certain output. This output is also known as a target. As you can see, the body of this step is literally our command from before but then indented.

The arrow (←) denotes the name of the step and its dependencies. More on this later.
The body is indented.
Select only list items.
Get the first five items.
Extract the id, and save to file top-5. Note that top-5 was already specified in the step definition and that 5 has now been used three times. We are going to address that later.This workflow is as simple as it gets. It doesn’t offer any advantages over having our command in a Bash script. But don’t worry, we promise you that it will get more exciting. For now, let’s run Drake and see what it does with our first workflow:

$ drake
The following steps will be run, in order:
  1: top-5 <-  [missing output]
Confirm? [y/n] y
Running 1 steps with concurrence of 1...
--- 0. Running (missing output): top-5 <-
--- 0: top-5 <-  -> done in 0.35s
Done (1 steps run).

If we do not specify any specific workflow file, then Drake will use ./Drakefile. Drake first determines which steps need to be run. In our case, the one and only step will be run because it’s missing the output. This means that there’s no file named data/top-5. Drake asks for confirmation before it will execute these steps. We press y, and very soon thereafter we see that Drake is done. Drake did not complain about any errors in our steps. Let’s verify that we have the top five books by looking at the output file data/top-5:

$ cat data/top-5
1342
76
11
1661
1952

Now we do have the output file. Let’s run Drake again:

$ drake
The following steps will be run, in order:
  1: top-5 <-  [no-input step]
Confirm? [y/n] n
Aborted.

As you can see, Drake wants to execute the step again! However, now mentions a different reason, namely, that there is no input step [no-input-step]. Its default behavior is to check whether the input has changed by looking at the timestamp of the input. However, since we didn’t specify any input, Drake doesn’t know whether or not this step should be run again. We can disable this default behavior to check timestamps as follows:

top-5 <- [-timecheck]
    curl -s 'http://www.gutenberg.org/browse/scores/top' |
    grep -E '^<li>' |
    head -n 5 |
    sed -E "s/.*ebooks\/([0-9]+)\">([^<]+)<.*/\\1,\\2/" > top-5

The square brackets around [-timecheck] indicate that this is an option to the step. The minus (-) means that we wish to disable checking timestamps. Now, this step is only run when the output is missing.

We’re going to use different filenames so that we keep old versions. We can specify a different workflow name (other than Drakefile) with the -w option. Let’s run Drake once more:

$ mv Drakefile 01.drake
$ drake -w 01.drake
Nothing to do.

Our very first workflow is already saving us time because Drake detects that the step was not need to be executed again. However, we can do much better than this. This workflow has three shortcomings that we’re going to address in the next section.