Chapter 5 Scrubbing Data - 5.2 Common Scrub Operations for Plain Text - 《Data Science at the Command Line》

5.2 Common Scrub Operations for Plain Text

5.2 Common Scrub Operations for Plain Text

In this section we describe common scrubbing operations for plain text. Formally, plain text refers to a sequence of human-readable characters and optionally, some specific types of control characters (for example a tab or a newline) (see: http://www.linfo.org/plain_text.html). Examples include: e-books, emails, log files, and source code.

For the purpose of this book, we assume that the plain text contains some data, and that it has no clear tabular structure (like the CSV format) or nested structure (like the JSON and HTML formats). We discuss those formats later in this chapter. Although these operations can also be applied to CSV, JSON and XML/HTML formats, keep in mind that the tools treat the data as plain text.

5.2.1 Filtering Lines

The first scrubbing operation is filtering lines. This means that from the input data, each line will be evaluated whether it may be passed on as output.

5.2.1.1 Based on Location

The most straightforward way to filter lines is based on their location. This may be useful when you want to inspect, say, the top 10 lines of a file, or when you extract a specific row from the output of another command-line tool. To illustrate how to filter based on location, let’s create a dummy file that contains 10 lines:

$ seq -f "Line %g" 10 | tee data/lines
Line 1
Line 2
Line 3
Line 4
Line 5
Line 6
Line 7
Line 8
Line 9
Line 10

We can print the first 3 lines using either head, sed, or awk:

$ < lines head -n 3
$ < lines sed -n '1,3p'
$ < lines awk 'NR<=3'
Line 1
Line 2
Line 3

Similarly, we can print the last 3 lines using tail (Rubin, MacKenzie, Taylor, et al. 2012):

$ < lines tail -n 3
Line 8
Line 9
Line 10

You can also you use sed and awk for this, but tail is much faster.

Removing the first 3 lines goes as follows:

$ < lines tail -n +4
$ < lines sed '1,3d'
$ < lines sed -n '1,3!p'
Line 4
Line 5
Line 6
Line 7
Line 8
Line 9
Line 10

Please notice that with tail you have to add one.

Removing the last 3 lines can be done with head:

$ < lines head -n -3
Line 1
Line 2
Line 3
Line 4
Line 5
Line 6
Line 7

You can print (or extract) specific lines (4, 5, and 6 in this case) using a either sed, awk, or a combination of head and tail:

$ < lines sed -n '4,6p'
$ < lines awk '(NR>=4)&&(NR<=6)'
$ < lines head -n 6 | tail -n 3
Line 4
Line 5
Line 6

Print odd lines with sed by specifying a start and a step, or with awk by using the modulo operator:

$ < lines sed -n '1~2p'
$ < lines awk 'NR%2'
Line 1
Line 3
Line 5
Line 7
Line 9

Printing even lines works in a similar manner:

$ < lines sed -n '0~2p'
$ < lines awk '(NR+1)%2'
Line 2
Line 4
Line 6
Line 8
Line 10

5.2.1.2 Based on Pattern

Sometimes you want to extract or remove lines based on their contents. Using grep, the canonical command-line tool for filtering lines, we can print every line that matches a certain pattern or regular expression. For example, to extract all the chapter headings from Alice’s Adventures in Wonderland:

$ grep -i chapter alice.txt
CHAPTER I. Down the Rabbit-Hole
CHAPTER II. The Pool of Tears
CHAPTER III. A Caucus-Race and a Long Tale
CHAPTER IV. The Rabbit Sends in a Little Bill
CHAPTER V. Advice from a Caterpillar
CHAPTER VI. Pig and Pepper
CHAPTER VII. A Mad Tea-Party
CHAPTER VIII. The Queen's Croquet-Ground
CHAPTER IX. The Mock Turtle's Story
CHAPTER X. The Lobster Quadrille
CHAPTER XI. Who Stole the Tarts?
CHAPTER XII. Alice's Evidence

Here, -i means case-insensitive. We can also specify a regular expression. For example, if we only wanted to print out the headings which start with The:

$ grep -E '^CHAPTER (.*)\. The' alice.txt
CHAPTER II. The Pool of Tears
CHAPTER IV. The Rabbit Sends in a Little Bill
CHAPTER VIII. The Queen's Croquet-Ground
CHAPTER IX. The Mock Turtle's Story
CHAPTER X. The Lobster Quadrille

Please note that you have to specify the -E command-line argument in order to enable regular expressions. Otherwise, grep interprets the pattern as a literal string.

5.2.1.3 Based on Randomness

When you’re in the process of formulating your data pipeline and you have a lot of data, then debugging your pipeline can be cumbersome. In that case, sampling from the data might be useful. The main purpose of the command-line tool sample (Janssens 2014 f) is to get a subset of the data by outputting only a certain percentage of the input on a line-by-line basis.

$ seq 1000 | sample -r 1% | jq -c '{line: .}'
{"line":53}
{"line":119}
{"line":141}
{"line":228}
{"line":464}
{"line":476}
{"line":523}
{"line":657}
{"line":675}
{"line":865}
{"line":948}

Here, every input line has a one percent chance of being forwarded to jq. This percentage could also have been specified as a fraction (1/100) or as a probability (0.01).

sample has two other purposes, which can be useful when you’re in debugging. First, it’s possible to add some delay to the output. This comes in handy when the input is a constant stream (for example, the Twitter firehose), and the data comes in too fast to see what’s going on. Secondly, you can put a timer on sample. This way, you don’t have to kill the ongoing process manually. To add a 1 second delay between each output line to the previous command and to only run for 5 seconds:

$ seq 10000 | sample -r 1% -d 1000 -s 5 | jq -c '{line: .}'

In order to prevent unnecessary computation, try to put sample as early as possible in your pipeline (this argument holds any command-line tool that reduces data, like head and tail). Once you’re done debugging you can simply take it out of the pipeline.

5.2.2 Extracting Values

To extract the actual chapter headings from our example earlier, we can take a simple approach by piping the output of grep to cut:

$ grep -i chapter alice.txt | cut -d' ' -f3-
Down the Rabbit-Hole
The Pool of Tears
A Caucus-Race and a Long Tale
The Rabbit Sends in a Little Bill
Advice from a Caterpillar
Pig and Pepper
A Mad Tea-Party
The Queen's Croquet-Ground
The Mock Turtle's Story
The Lobster Quadrille
Who Stole the Tarts?
Alice's Evidence

Here, each line that’s passed to cut is being split on spaces into fields, and then the third field to the last field is being printed. The total number of fields may be different per input line. With sed we can accomplish the same task in a much more complex manner:

$ sed -rn 's/^CHAPTER ([IVXLCDM]{1,})\. (.*)$/\2/p' alice.txt > /dev/null

(Since the output is the same it’s omitted by redirecting it to /dev/null.) This approach uses a regular expression and a back reference. Here, sed also takes over the work done by grep. This complex approach is only advisable when a simpler one would not work. For example, if chapter was ever part of the text itself and not just used to indicate the start of a new chapter. Of course there are many levels of complexity which would have worked around this, but this was to illustrate an extremely strict approach. In practice, the challenge is to find a good balance between complexity and flexibility.

It’s worth noting that cut can also split on characters positions. This is useful for when you want to extract (or remove) the same set of characters per input line:

$ grep -i chapter alice.txt | cut -c 9-
I. Down the Rabbit-Hole
II. The Pool of Tears
III. A Caucus-Race and a Long Tale
IV. The Rabbit Sends in a Little Bill
V. Advice from a Caterpillar
VI. Pig and Pepper
VII. A Mad Tea-Party
VIII. The Queen's Croquet-Ground
IX. The Mock Turtle's Story
X. The Lobster Quadrille
XI. Who Stole the Tarts?
XII. Alice's Evidence

grep has a great feature that outputs every match onto a separate line:

$ < alice.txt grep -oE '\w{2,}' | head
Project
Gutenberg
Alice
Adventures
in
Wonderland
by
Lewis
Carroll
This

But what if we wanted to create a data set of all the words that start with an a and end with an e. Well, of course there’s a pipeline for that too:

$ < alice.txt tr '[:upper:]' '[:lower:]' | grep -oE '\w{2,}' |
> grep -E '^a.*e$' | sort | uniq -c | sort -nr |
> awk '{print $2","$1}' | header -a word,count | head | csvlook
|-------------+--------|
|  word       | count  |
|-------------+--------|
|  alice      | 403    |
|  are        | 73     |
|  archive    | 13     |
|  agree      | 11     |
|  anyone     | 5      |
|  alone      | 5      |
|  age        | 4      |
|  applicable | 3      |
|  anywhere   | 3      |
|  alive      | 3      |
|-------------+--------|

5.2.3 Replacing and Deleting Values

You can use the command-line tool tr, which stands for translate, to replace individual characters. For example, spaces can be replaced by underscores as follows:

$ echo 'hello world!' | tr ' ' '_'
hello_world!

If more than one character needs to be replaced, then you can combine that:

$ echo 'hello world!' | tr ' !' '_?'
hello_world?

tr can also be used to delete individual characters by specifying the argument -d:

$ echo 'hello world!' | tr -d -c '[a-z]'
helloworld

Here, we’ve actually used two more features. First we’ve specified a set of characters (all lowercase letters). Second we’ve indicated that the complement -c should be used. In other words, this command only retains lowercase letters. We can even use tr to convert our text to uppercase:

$ echo 'hello world!' | tr '[a-z]' '[A-Z]'
HELLO WORLD!
$ echo 'hello world!' | tr '[:lower:]' '[:upper:]'
HELLO WORLD!

The latter command is preferable because that also handles non-ASCII characters. If you need to operate on more than individual characters, then you may find sed useful. We’ve already seen an example of sed with extracting the chapter headings from Alice in Wonderland. Extracting, deleting, and replacing is actually all the same operation in sed. You just specify different regular expressions. For example, to change a word, remove repeated spaces, and remove leading spaces:

$ echo ' hello     world!' | sed -re 's/hello/bye/;s/\s+/ /g;s/\s+//'
bye world!

The argument -g stands for global, meaning that the same command can be applied more than once on the same line. We do not need that with the second command, which removes leading spaces. Note that regular expressions of the first and the last command could have been combined into one regular expression.