Chapter 5 Scrubbing Data

Two chapters ago, in Step 1 of the OSEMN model for data science, we looked at how to obtain data from a variety of sources. It’s not uncommon for this data to have missing values, inconsistencies, errors, weird characters, or uninteresting columns. Sometimes we only need a specific portion of the data. And sometimes we need the data to be in a different format. In those cases, we have to scrub, or clean, the data before we can move on to Step 3: Exploring Data.

The data we obtained in Chapter 3 can come in a variety of formats. The most common ones are plain text, CSV, JSON, and HTML/XML. Since most command-line tools operate on one format only, it is worthwhile to be able to convert data from one format to another.

CSV, which is the main format we’re working with in this chapter, is actually not the easiest format to work with. Many CSV data sets are broken or incompatible with each other because there is no standard syntax, unlike XML and JSON.

Once our data is in the format we want it to be, we can apply common scrubbing operations. These include filtering, replacing, and merging data. The command line is especially well-suited for these kind of operations, as there exist many powerful command-line tools that are optimized for handling large amounts of data. Tools that we’ll discuss in this chapter include classic ones such as: cut (Ihnat, MacKenzie, and Meyering 2012) and sed (Fenlason et al. 2012), and newer ones such as jq (Dolan 2014) and csvgrep (Groskopf 2014e).

The scrubbing tasks that we discuss in this chapter not only apply to the input data. Sometimes, we also need to reformat the output of some command-line tools. For example, to transform the output of uniq -c to a CSV data set, we could use awk (Brennan 1994) and header:

  1. $ echo 'foo\nbar\nfoo' | sort | uniq -c | sort -nr
  2. 2 foo
  3. 1 bar
  4. $ echo 'foo\nbar\nfoo' | sort | uniq -c | sort -nr |
  5. > awk '{print $2","$1}' | header -a
  6. value,count
  7. foo,2
  8. bar,1

If your data requires additional functionality than that is offered by (a combination of) these command-line tools, you can use csvsql. This is a new command-line tool that allow you to perform SQL queries directly on CSV files. And remember, if after reading this chapter you still need more flexibility, you’re free to use R, Python, or whatever programming language you prefer.

The command-line tools will be introduced on a need-to-use basis. You will notice that sometimes we can use the same command-line tool to perform multiple operations, or vice versa, multiple command-line tools to perform the same operation. This chapter is more structured like a cookbook, where the focus is on the problems or recipes, rather than on the command-line tools.