5.2 Common Scrub Operations for Plain Text

In this section we describe common scrubbing operations for plain text. Formally, plain text refers to a sequence of human-readable characters and optionally, some specific types of control characters (for example a tab or a newline) (see: http://www.linfo.org/plain_text.html). Examples include: e-books, emails, log files, and source code.

For the purpose of this book, we assume that the plain text contains some data, and that it has no clear tabular structure (like the CSV format) or nested structure (like the JSON and HTML formats). We discuss those formats later in this chapter. Although these operations can also be applied to CSV, JSON and XML/HTML formats, keep in mind that the tools treat the data as plain text.

5.2.1 Filtering Lines

The first scrubbing operation is filtering lines. This means that from the input data, each line will be evaluated whether it may be passed on as output.

5.2.1.1 Based on Location

The most straightforward way to filter lines is based on their location. This may be useful when you want to inspect, say, the top 10 lines of a file, or when you extract a specific row from the output of another command-line tool. To illustrate how to filter based on location, let’s create a dummy file that contains 10 lines:

  1. $ seq -f "Line %g" 10 | tee data/lines
  2. Line 1
  3. Line 2
  4. Line 3
  5. Line 4
  6. Line 5
  7. Line 6
  8. Line 7
  9. Line 8
  10. Line 9
  11. Line 10

We can print the first 3 lines using either head, sed, or awk:

  1. $ < lines head -n 3
  2. $ < lines sed -n '1,3p'
  3. $ < lines awk 'NR<=3'
  4. Line 1
  5. Line 2
  6. Line 3

Similarly, we can print the last 3 lines using tail (Rubin, MacKenzie, Taylor, et al. 2012):

  1. $ < lines tail -n 3
  2. Line 8
  3. Line 9
  4. Line 10

You can also you use sed and awk for this, but tail is much faster.

Removing the first 3 lines goes as follows:

  1. $ < lines tail -n +4
  2. $ < lines sed '1,3d'
  3. $ < lines sed -n '1,3!p'
  4. Line 4
  5. Line 5
  6. Line 6
  7. Line 7
  8. Line 8
  9. Line 9
  10. Line 10

Please notice that with tail you have to add one.

Removing the last 3 lines can be done with head:

  1. $ < lines head -n -3
  2. Line 1
  3. Line 2
  4. Line 3
  5. Line 4
  6. Line 5
  7. Line 6
  8. Line 7

You can print (or extract) specific lines (4, 5, and 6 in this case) using a either sed, awk, or a combination of head and tail:

  1. $ < lines sed -n '4,6p'
  2. $ < lines awk '(NR>=4)&&(NR<=6)'
  3. $ < lines head -n 6 | tail -n 3
  4. Line 4
  5. Line 5
  6. Line 6

Print odd lines with sed by specifying a start and a step, or with awk by using the modulo operator:

  1. $ < lines sed -n '1~2p'
  2. $ < lines awk 'NR%2'
  3. Line 1
  4. Line 3
  5. Line 5
  6. Line 7
  7. Line 9

Printing even lines works in a similar manner:

  1. $ < lines sed -n '0~2p'
  2. $ < lines awk '(NR+1)%2'
  3. Line 2
  4. Line 4
  5. Line 6
  6. Line 8
  7. Line 10

5.2.1.2 Based on Pattern

Sometimes you want to extract or remove lines based on their contents. Using grep, the canonical command-line tool for filtering lines, we can print every line that matches a certain pattern or regular expression. For example, to extract all the chapter headings from Alice’s Adventures in Wonderland:

  1. $ grep -i chapter alice.txt
  2. CHAPTER I. Down the Rabbit-Hole
  3. CHAPTER II. The Pool of Tears
  4. CHAPTER III. A Caucus-Race and a Long Tale
  5. CHAPTER IV. The Rabbit Sends in a Little Bill
  6. CHAPTER V. Advice from a Caterpillar
  7. CHAPTER VI. Pig and Pepper
  8. CHAPTER VII. A Mad Tea-Party
  9. CHAPTER VIII. The Queen's Croquet-Ground
  10. CHAPTER IX. The Mock Turtle's Story
  11. CHAPTER X. The Lobster Quadrille
  12. CHAPTER XI. Who Stole the Tarts?
  13. CHAPTER XII. Alice's Evidence

Here, -i means case-insensitive. We can also specify a regular expression. For example, if we only wanted to print out the headings which start with The:

  1. $ grep -E '^CHAPTER (.*)\. The' alice.txt
  2. CHAPTER II. The Pool of Tears
  3. CHAPTER IV. The Rabbit Sends in a Little Bill
  4. CHAPTER VIII. The Queen's Croquet-Ground
  5. CHAPTER IX. The Mock Turtle's Story
  6. CHAPTER X. The Lobster Quadrille

Please note that you have to specify the -E command-line argument in order to enable regular expressions. Otherwise, grep interprets the pattern as a literal string.

5.2.1.3 Based on Randomness

When you’re in the process of formulating your data pipeline and you have a lot of data, then debugging your pipeline can be cumbersome. In that case, sampling from the data might be useful. The main purpose of the command-line tool sample (Janssens 2014f) is to get a subset of the data by outputting only a certain percentage of the input on a line-by-line basis.

  1. $ seq 1000 | sample -r 1% | jq -c '{line: .}'
  2. {"line":53}
  3. {"line":119}
  4. {"line":141}
  5. {"line":228}
  6. {"line":464}
  7. {"line":476}
  8. {"line":523}
  9. {"line":657}
  10. {"line":675}
  11. {"line":865}
  12. {"line":948}

Here, every input line has a one percent chance of being forwarded to jq. This percentage could also have been specified as a fraction (1/100) or as a probability (0.01).

sample has two other purposes, which can be useful when you’re in debugging. First, it’s possible to add some delay to the output. This comes in handy when the input is a constant stream (for example, the Twitter firehose), and the data comes in too fast to see what’s going on. Secondly, you can put a timer on sample. This way, you don’t have to kill the ongoing process manually. To add a 1 second delay between each output line to the previous command and to only run for 5 seconds:

  1. $ seq 10000 | sample -r 1% -d 1000 -s 5 | jq -c '{line: .}'

In order to prevent unnecessary computation, try to put sample as early as possible in your pipeline (this argument holds any command-line tool that reduces data, like head and tail). Once you’re done debugging you can simply take it out of the pipeline.

5.2.2 Extracting Values

To extract the actual chapter headings from our example earlier, we can take a simple approach by piping the output of grep to cut:

  1. $ grep -i chapter alice.txt | cut -d' ' -f3-
  2. Down the Rabbit-Hole
  3. The Pool of Tears
  4. A Caucus-Race and a Long Tale
  5. The Rabbit Sends in a Little Bill
  6. Advice from a Caterpillar
  7. Pig and Pepper
  8. A Mad Tea-Party
  9. The Queen's Croquet-Ground
  10. The Mock Turtle's Story
  11. The Lobster Quadrille
  12. Who Stole the Tarts?
  13. Alice's Evidence

Here, each line that’s passed to cut is being split on spaces into fields, and then the third field to the last field is being printed. The total number of fields may be different per input line. With sed we can accomplish the same task in a much more complex manner:

  1. $ sed -rn 's/^CHAPTER ([IVXLCDM]{1,})\. (.*)$/\2/p' alice.txt > /dev/null

(Since the output is the same it’s omitted by redirecting it to /dev/null.) This approach uses a regular expression and a back reference. Here, sed also takes over the work done by grep. This complex approach is only advisable when a simpler one would not work. For example, if chapter was ever part of the text itself and not just used to indicate the start of a new chapter. Of course there are many levels of complexity which would have worked around this, but this was to illustrate an extremely strict approach. In practice, the challenge is to find a good balance between complexity and flexibility.

It’s worth noting that cut can also split on characters positions. This is useful for when you want to extract (or remove) the same set of characters per input line:

  1. $ grep -i chapter alice.txt | cut -c 9-
  2. I. Down the Rabbit-Hole
  3. II. The Pool of Tears
  4. III. A Caucus-Race and a Long Tale
  5. IV. The Rabbit Sends in a Little Bill
  6. V. Advice from a Caterpillar
  7. VI. Pig and Pepper
  8. VII. A Mad Tea-Party
  9. VIII. The Queen's Croquet-Ground
  10. IX. The Mock Turtle's Story
  11. X. The Lobster Quadrille
  12. XI. Who Stole the Tarts?
  13. XII. Alice's Evidence

grep has a great feature that outputs every match onto a separate line:

  1. $ < alice.txt grep -oE '\w{2,}' | head
  2. Project
  3. Gutenberg
  4. Alice
  5. Adventures
  6. in
  7. Wonderland
  8. by
  9. Lewis
  10. Carroll
  11. This

But what if we wanted to create a data set of all the words that start with an a and end with an e. Well, of course there’s a pipeline for that too:

  1. $ < alice.txt tr '[:upper:]' '[:lower:]' | grep -oE '\w{2,}' |
  2. > grep -E '^a.*e$' | sort | uniq -c | sort -nr |
  3. > awk '{print $2","$1}' | header -a word,count | head | csvlook
  4. |-------------+--------|
  5. | word | count |
  6. |-------------+--------|
  7. | alice | 403 |
  8. | are | 73 |
  9. | archive | 13 |
  10. | agree | 11 |
  11. | anyone | 5 |
  12. | alone | 5 |
  13. | age | 4 |
  14. | applicable | 3 |
  15. | anywhere | 3 |
  16. | alive | 3 |
  17. |-------------+--------|

5.2.3 Replacing and Deleting Values

You can use the command-line tool tr, which stands for translate, to replace individual characters. For example, spaces can be replaced by underscores as follows:

  1. $ echo 'hello world!' | tr ' ' '_'
  2. hello_world!

If more than one character needs to be replaced, then you can combine that:

  1. $ echo 'hello world!' | tr ' !' '_?'
  2. hello_world?

tr can also be used to delete individual characters by specifying the argument -d:

  1. $ echo 'hello world!' | tr -d -c '[a-z]'
  2. helloworld

Here, we’ve actually used two more features. First we’ve specified a set of characters (all lowercase letters). Second we’ve indicated that the complement -c should be used. In other words, this command only retains lowercase letters. We can even use tr to convert our text to uppercase:

  1. $ echo 'hello world!' | tr '[a-z]' '[A-Z]'
  2. HELLO WORLD!
  3. $ echo 'hello world!' | tr '[:lower:]' '[:upper:]'
  4. HELLO WORLD!

The latter command is preferable because that also handles non-ASCII characters. If you need to operate on more than individual characters, then you may find sed useful. We’ve already seen an example of sed with extracting the chapter headings from Alice in Wonderland. Extracting, deleting, and replacing is actually all the same operation in sed. You just specify different regular expressions. For example, to change a word, remove repeated spaces, and remove leading spaces:

  1. $ echo ' hello world!' | sed -re 's/hello/bye/;s/\s+/ /g;s/\s+//'
  2. bye world!

The argument -g stands for global, meaning that the same command can be applied more than once on the same line. We do not need that with the second command, which removes leading spaces. Note that regular expressions of the first and the last command could have been combined into one regular expression.