5.5 Common Scrub Operations for CSV

5.5.1 Extracting and Reordering Columns

Columns can be extracted and reordered using the command-line tool: csvcut (Groskopf 2014g). For example, to keep only the columns in the Iris data set that contain numerical values and reorder the middle two columns:

  1. $ < iris.csv csvcut -c sepal_length,petal_length,sepal_width,petal_width |
  2. > head -n 5 | csvlook
  3. |---------------+--------------+-------------+--------------|
  4. | sepal_length | petal_length | sepal_width | petal_width |
  5. |---------------+--------------+-------------+--------------|
  6. | 5.1 | 1.4 | 3.5 | 0.2 |
  7. | 4.9 | 1.4 | 3.0 | 0.2 |
  8. | 4.7 | 1.3 | 3.2 | 0.2 |
  9. | 4.6 | 1.5 | 3.1 | 0.2 |
  10. |---------------+--------------+-------------+--------------|

Alternatively, we can also specify the columns we want to leave out with -C, which stands for complement:

  1. $ < iris.csv csvcut -C species | head -n 5 | csvlook
  2. |---------------+-------------+--------------+--------------|
  3. | sepal_length | sepal_width | petal_length | petal_width |
  4. |---------------+-------------+--------------+--------------|
  5. | 5.1 | 3.5 | 1.4 | 0.2 |
  6. | 4.9 | 3.0 | 1.4 | 0.2 |
  7. | 4.7 | 3.2 | 1.3 | 0.2 |
  8. | 4.6 | 3.1 | 1.5 | 0.2 |
  9. |---------------+-------------+--------------+--------------|

Here, the included columns are kept in the same order. Instead of the column names, you can also specify the indices of the columns, which start at 1. This allows you to, for example, select only the odd columns (should you ever need it!):

  1. $ echo 'a,b,c,d,e,f,g,h,i\n1,2,3,4,5,6,7,8,9' |
  2. > csvcut -c $(seq 1 2 9 | paste -sd,)
  3. a,c,e,g,i
  4. 1,3,5,7,9

If you’re certain that there are no comma’s in any of the values, then you can also use cut to extract columns. Be aware that cut does not reorder columns, as is demonstrated with the following command:

  1. $ echo 'a,b,c,d,e,f,g,h,i\n1,2,3,4,5,6,7,8,9' | cut -d, -f 5,1,3
  2. a,c,e
  3. 1,3,5

As you can see, it does not matter in which order we specify the columns with -f, with cut they will always appear in the original order. For completeness, let’s also take a look at the SQL approach for extracting and reordering the numerical columns of the Iris data set:

  1. $ < iris.csv csvsql --query "SELECT sepal_length, petal_length, "\
  2. > "sepal_width, petal_width FROM stdin" | head -n 5 | csvlook
  3. |---------------+--------------+-------------+--------------|
  4. | sepal_length | petal_length | sepal_width | petal_width |
  5. |---------------+--------------+-------------+--------------|
  6. | 5.1 | 1.4 | 3.5 | 0.2 |
  7. | 4.9 | 1.4 | 3.0 | 0.2 |
  8. | 4.7 | 1.3 | 3.2 | 0.2 |
  9. | 4.6 | 1.5 | 3.1 | 0.2 |
  10. |---------------+--------------+-------------+--------------|

5.5.2 Filtering Lines

The difference between filtering lines in a CSV file as opposed to a plain text file is that you may want to base this filtering on values in a certain column, only. Filtering on location is essentially the same, but you have to take into account that the first line of a CSV file is usually the header. Remember that you can always use the body command-line tool if you want to keep the header:

  1. $ seq 5 | sed -n '3,5p'
  2. 3
  3. 4
  4. 5
  5. $ seq 5 | header -a count | body sed -n '3,5p'
  6. count
  7. 3
  8. 4
  9. 5

When it comes down to filtering on a certain pattern within a certain column, we can use either csvgrep, awk, or, of course, csvsql. For example, to exclude all the bills of which the party size was 4 or less:

  1. $ csvgrep -c size -i -r "[1-4]" tips.csv | csvlook
  2. |--------+------+--------+--------+------+--------+-------|
  3. | bill | tip | sex | smoker | day | time | size |
  4. |--------+------+--------+--------+------+--------+-------|
  5. | 29.8 | 4.2 | Female | No | Thur | Lunch | 6 |
  6. | 34.3 | 6.7 | Male | No | Thur | Lunch | 6 |
  7. | 41.19 | 5.0 | Male | No | Thur | Lunch | 5 |
  8. | 27.05 | 5.0 | Female | No | Thur | Lunch | 6 |
  9. | 29.85 | 5.14 | Female | No | Sun | Dinner | 5 |
  10. | 48.17 | 5.0 | Male | No | Sun | Dinner | 6 |
  11. | 20.69 | 5.0 | Male | No | Sun | Dinner | 5 |
  12. | 30.46 | 2.0 | Male | Yes | Sun | Dinner | 5 |
  13. | 28.15 | 3.0 | Male | Yes | Sat | Dinner | 5 |
  14. |--------+------+--------+--------+------+--------+-------|

Both awk and csvsql can also do numerical comparisons. For example, to get all the bills above 40 USD on a Saturday or a Sunday:

  1. $ < tips.csv awk -F, '($1 > 40.0) && ($5 ~ /S/)' | csvlook -I
  2. |--------+------+--------+-----+-----+--------+----|
  3. | 48.27 | 6.73 | Male | No | Sat | Dinner | 4 |
  4. |--------+------+--------+-----+-----+--------+----|
  5. | 44.3 | 2.5 | Female | Yes | Sat | Dinner | 3 |
  6. | 48.17 | 5.0 | Male | No | Sun | Dinner | 6 |
  7. | 50.81 | 10.0 | Male | Yes | Sat | Dinner | 3 |
  8. | 45.35 | 3.5 | Male | Yes | Sun | Dinner | 3 |
  9. | 40.55 | 3.0 | Male | Yes | Sun | Dinner | 2 |
  10. | 48.33 | 9.0 | Male | No | Sat | Dinner | 4 |
  11. |--------+------+--------+-----+-----+--------+----|

The csvsql solution is more verbose but is also more robust as it uses the names of the columns instead of their indexes:

  1. $ < tips.csv csvsql --query "SELECT * FROM stdin "\
  2. > "WHERE bill > 40 AND day LIKE '%S%'" | csvlook -I
  3. |--------+------+--------+--------+-----+--------+-------|
  4. | bill | tip | sex | smoker | day | time | size |
  5. |--------+------+--------+--------+-----+--------+-------|
  6. | 48.27 | 6.73 | Male | 0 | Sat | Dinner | 4 |
  7. | 44.3 | 2.5 | Female | 1 | Sat | Dinner | 3 |
  8. | 48.17 | 5.0 | Male | 0 | Sun | Dinner | 6 |
  9. | 50.81 | 10.0 | Male | 1 | Sat | Dinner | 3 |
  10. | 45.35 | 3.5 | Male | 1 | Sun | Dinner | 3 |
  11. | 40.55 | 3.0 | Male | 1 | Sun | Dinner | 2 |
  12. | 48.33 | 9.0 | Male | 0 | Sat | Dinner | 4 |
  13. |--------+------+--------+--------+-----+--------+-------|

It should be noted that the flexibility of the WHERE clause in an SQL query cannot be easily matched with other command-line tools, as SQL can operate on dates and sets, and form complex combinations of clauses.

5.5.3 Merging Columns

Merging columns is useful for when the values of interest are spread over multiple columns. This may happen with dates (where year, month, and day could be separate columns) or names (where the first name and last name are separate columns). Let’s consider the second situation.

The input CSV is a list of contemporary composers. Imagine our task is to combine the first name and the last name into a full name. We’ll present four different approaches for this task: sed, awk, cols + tr, and csvsql. Let’s have a look at the input CSV:

  1. $ < names.csv csvlook -I
  2. |-----+-----------+------------+-------|
  3. | id | last_name | first_name | born |
  4. |-----+-----------+------------+-------|
  5. | 1 | Williams | John | 1932 |
  6. | 2 | Elfman | Danny | 1953 |
  7. | 3 | Horner | James | 1953 |
  8. | 4 | Shore | Howard | 1946 |
  9. | 5 | Zimmer | Hans | 1957 |
  10. |-----+-----------+------------+-------|

The first approach, sed, uses two statements. The first is to replace the header and the second is a regular expression with back references applied to the second row onwards:

  1. $ < names.csv sed -re '1s/.*/id,full_name,born/g;'\
  2. > '2,$s/(.*),(.*),(.*),(.*)/\1,\3 \2,\4/g' | csvlook -I
  3. |-----+---------------+-------|
  4. | id | full_name | born |
  5. |-----+---------------+-------|
  6. | 1 | John Williams | 1932 |
  7. | 2 | Danny Elfman | 1953 |
  8. | 3 | James Horner | 1953 |
  9. | 4 | Howard Shore | 1946 |
  10. | 5 | Hans Zimmer | 1957 |
  11. |-----+---------------+-------|

The awk approach looks as follows:

  1. $ < names.csv awk -F, 'BEGIN{OFS=","; print "id,full_name,born"}'\
  2. > '{if(NR > 1) {print $1,$3" "$2,$4}}' | csvlook -I
  3. |-----+---------------+-------|
  4. | id | full_name | born |
  5. |-----+---------------+-------|
  6. | 1 | John Williams | 1932 |
  7. | 2 | Danny Elfman | 1953 |
  8. | 3 | James Horner | 1953 |
  9. | 4 | Howard Shore | 1946 |
  10. | 5 | Hans Zimmer | 1957 |
  11. |-----+---------------+-------|

The cols approach in combination with tr:

  1. $ < names.csv | cols -c first_name,last_name tr \",\" \" \" |
  2. > header -r full_name,id,born | csvcut -c id,full_name,born | csvlook -I
  3. |-----+---------------+-------|
  4. | id | full_name | born |
  5. |-----+---------------+-------|
  6. | 1 | John Williams | 1932 |
  7. | 2 | Danny Elfman | 1953 |
  8. | 3 | James Horner | 1953 |
  9. | 4 | Howard Shore | 1946 |
  10. | 5 | Hans Zimmer | 1957 |
  11. |-----+---------------+-------|

Please note that csvsql employ SQLite as the database to execute the query and that || stands for concatenation:

  1. $ < names.csv csvsql --query "SELECT id, first_name || ' ' || last_name "\
  2. > "AS full_name, born FROM stdin" | csvlook -I
  3. |-----+-----------------------+-------|
  4. | id | full_name | born |
  5. |-----+-----------------------+-------|
  6. | 1 | John Williams | 1932 |
  7. | 2 | Danny Elfman | 1953 |
  8. | 3 | James Horner | 1953 |
  9. | 4 | Howard Shore | 1946 |
  10. | 5 | Hans Zimmer | 1957 |
  11. |-----+-----------------------+-------|

What if last_name would contain a comma? Let’s have a look at the raw input CSV for clarity sake:

  1. $ cat names-comma.csv
  2. id,last_name,first_name,born
  3. 1,Williams,John,1932
  4. 2,Elfman,Danny,1953
  5. 3,Horner,James,1953
  6. 4,Shore,Howard,1946
  7. 5,Zimmer,Hans,1957
  8. 6,"Beethoven, van",Ludwig,1770

Well, it appears that the first three approaches fail; all in different ways. Only csvsql is able to combine first_name and full_name:

  1. $ < names-comma.csv sed -re '1s/.*/id,full_name,born/g;'\
  2. > '2,$s/(.*),(.*),(.*),(.*)/\1,\3 \2,\4/g' | tail -n 1
  3. 6,"Beethoven,Ludwig van",1770
  1. $ < names-comma.csv awk -F, 'BEGIN{OFS=","; print "id,full_name,born"}'\
  2. > '{if(NR > 1) {print $1,$3" "$2,$4}}' | tail -n 1
  3. 6, van" "Beethoven,Ludwig
  1. $ < names-comma.csv | cols -c first_name,last_name tr \",\" \" \" |
  2. > header -r full_name,id,born | csvcut -c id,full_name,born | tail -n 1
  3. 6,"Ludwig ""Beethoven van""",1770
  1. $ < names-comma.csv csvsql --query "SELECT id, first_name || ' ' || last_name"\
  2. > " AS full_name, born FROM stdin" | tail -n 1
  3. 6,"Ludwig Beethoven, van",1770
  1. $ < names-comma.csv Rio -e 'df$full_name <- paste(df$first_name, df$last_name);'\
  2. > 'df[c("id","full_name","born")]' | tail -n 1
  3. 6,"Ludwig Beethoven, van",1770

Wait a minute! What’s that last command? Is that R? Well, as a matter of fact, it is. It’s R code evaluated through a command-line tool called Rio (Janssens 2014e). All that we can say at this moment, is that also this approach succeeds at merging the two columns. We’ll discuss this nifty command-line tool later.

5.5.4 Combining Multiple CSV Files

5.5.4.1 Concatenate Vertically

Vertical concatenation may be necessary in cases where you have, for example, a data set which is generated on a daily basis, or where each data set represents a different, say, market or product. Let’s simulate the latter by splitting up our beloved Iris data set into three CSV files, so that we have something to combine again. We’ll use fieldsplit (Hinds et al. 2010), which is part of the CRUSH suite of command-line tools:

  1. $ < iris.csv fieldsplit -d, -k -F species -p . -s .csv

Here, the command-line arguments specify: the delimiter (-d), that we want to keep the header in each file (-k), the column whose values dictate the possible output files (-F), the relative output path (-p), and the filename suffix (-s), respectively. Because the species column in the Iris data set contains three different values, we end up with three CSV files, each with 50 data points and a header:

  1. $ wc -l Iris-*.csv
  2. 51 Iris-setosa.csv
  3. 51 Iris-versicolor.csv
  4. 51 Iris-virginica.csv
  5. 153 total

You could just concatenate the files back using cat and removing the headers of all but the first file using header -d as follows:

  1. $ cat Iris-setosa.csv <(< Iris-versicolor.csv header -d) \
  2. > <(< Iris-virginica.csv header -d) | sed -n '1p;49,54p' | csvlook
  3. |---------------+-------------+--------------+-------------+------------------|
  4. | sepal_length | sepal_width | petal_length | petal_width | species |
  5. |---------------+-------------+--------------+-------------+------------------|
  6. | 4.6 | 3.2 | 1.4 | 0.2 | Iris-setosa |
  7. | 5.3 | 3.7 | 1.5 | 0.2 | Iris-setosa |
  8. | 5.0 | 3.3 | 1.4 | 0.2 | Iris-setosa |
  9. | 7.0 | 3.2 | 4.7 | 1.4 | Iris-versicolor |
  10. | 6.4 | 3.2 | 4.5 | 1.5 | Iris-versicolor |
  11. | 6.9 | 3.1 | 4.9 | 1.5 | Iris-versicolor |
  12. |---------------+-------------+--------------+-------------+------------------|

Note that we’re merely using sed to only print the header and the first three body rows that belonged to the second file in order to illustrate success. While this method works, it’s easier (and less prone to errors) to use csvstack (Groskopf 2014h):

  1. $ csvstack Iris-*.csv | sed -n '1p;49,54p' | csvlook
  2. |---------------+-------------+--------------+-------------+------------------|
  3. | sepal_length | sepal_width | petal_length | petal_width | species |
  4. |---------------+-------------+--------------+-------------+------------------|
  5. | 4.6 | 3.2 | 1.4 | 0.2 | Iris-setosa |
  6. | 5.3 | 3.7 | 1.5 | 0.2 | Iris-setosa |
  7. | 5.0 | 3.3 | 1.4 | 0.2 | Iris-setosa |
  8. | 7.0 | 3.2 | 4.7 | 1.4 | Iris-versicolor |
  9. | 6.4 | 3.2 | 4.5 | 1.5 | Iris-versicolor |
  10. | 6.9 | 3.1 | 4.9 | 1.5 | Iris-versicolor |
  11. |---------------+-------------+--------------+-------------+------------------|

If the species column did not exist, you can create a new column based on the filename using csvstack:

  1. $ csvstack Iris-*.csv -n species --filenames

Alternatively, you could specify the group names using -g:

  1. $ csvstack Iris-*.csv -n class -g a,b,c | csvcut -C species |
  2. > sed -n '1p;49,54p' | csvlook
  3. |--------+--------------+-------------+--------------+--------------|
  4. | class | sepal_length | sepal_width | petal_length | petal_width |
  5. |--------+--------------+-------------+--------------+--------------|
  6. | a | 4.6 | 3.2 | 1.4 | 0.2 |
  7. | a | 5.3 | 3.7 | 1.5 | 0.2 |
  8. | a | 5.0 | 3.3 | 1.4 | 0.2 |
  9. | b | 7.0 | 3.2 | 4.7 | 1.4 |
  10. | b | 6.4 | 3.2 | 4.5 | 1.5 |
  11. | b | 6.9 | 3.1 | 4.9 | 1.5 |
  12. |--------+--------------+-------------+--------------+--------------|

The new column class is added at the front. If you’d like to change the order you can use csvcut as discussed earlier in this section.

5.5.4.2 Concatenate Horizontally

Let’s say you have three CSV files that want to put side by side. We use tee (Parker, Stallman, and MacKenzie 2012) to save the result of csvcut in the middle of the pipeline:

  1. $ < data/tips.csv csvcut -c bill,tip | tee data/bills.csv | head -n 3 | csvlook
  2. |--------+-------|
  3. | bill | tip |
  4. |--------+-------|
  5. | 16.99 | 1.01 |
  6. | 10.34 | 1.66 |
  7. |--------+-------|
  8. $ < data/tips.csv csvcut -c day,time | tee data/datetime.csv |
  9. > head -n 3 | csvlook -I
  10. |------+---------|
  11. | day | time |
  12. |------+---------|
  13. | Sun | Dinner |
  14. | Sun | Dinner |
  15. |------+---------|
  16. $ < data/tips.csv csvcut -c sex,smoker,size | tee data/customers.csv |
  17. > head -n 3 | csvlook
  18. |---------+--------+-------|
  19. | sex | smoker | size |
  20. |---------+--------+-------|
  21. | Female | No | 2 |
  22. | Male | No | 3 |
  23. |---------+--------+-------|

Assuming that the rows line up, you can simply paste (Ihnat and MacKenzie 2012) the files together:

  1. $ paste -d, data/{bills,customers,datetime}.csv | head -n 3 | csvlook -I
  2. |--------+------+--------+--------+------+-----+---------|
  3. | bill | tip | sex | smoker | size | day | time |
  4. |--------+------+--------+--------+------+-----+---------|
  5. | 16.99 | 1.01 | Female | No | 2 | Sun | Dinner |
  6. | 10.34 | 1.66 | Male | No | 3 | Sun | Dinner |
  7. |--------+------+--------+--------+------+-----+---------|

Here, the command-line argument -d instructs paste to use a comma as the delimiter.

5.5.4.3 Joining

Sometimes data cannot simply by combined by vertical or horizontal concatenation. In some cases, especially in relational databases, the data is spread over multiple tables (or files) in order to minimize redundancy. Imagine we wanted to extend the Iris data set with more information about the three types of Iris flowers, namely the USDA identifier. It so happens that we have separate CSV file with these identifiers:

  1. $ csvlook irismeta.csv
  2. |------------------+----------------------------------------------+----------|
  3. | species | wikipedia_url | usda_id |
  4. |------------------+----------------------------------------------+----------|
  5. | Iris-versicolor | http://en.wikipedia.org/wiki/Iris_versicolor | IRVE2 |
  6. | Iris-virginica | http://en.wikipedia.org/wiki/Iris_virginica | IRVI |
  7. | Iris-setosa | | IRSE |
  8. |------------------+----------------------------------------------+----------|

What this data set and the Iris data set have in common is the species column. We can use csvjoin (Groskopf 2014i) to join the two data sets:

  1. $ csvjoin -c species iris.csv irismeta.csv | csvcut -c sepal_length,\
  2. > sepal_width,species,usda_id | sed -n '1p;49,54p' | csvlook
  3. |---------------+-------------+-----------------+----------|
  4. | sepal_length | sepal_width | species | usda_id |
  5. |---------------+-------------+-----------------+----------|
  6. | 4.6 | 3.2 | Iris-setosa | IRSE |
  7. | 5.3 | 3.7 | Iris-setosa | IRSE |
  8. | 5.0 | 3.3 | Iris-setosa | IRSE |
  9. | 7.0 | 3.2 | Iris-versicolor | IRVE2 |
  10. | 6.4 | 3.2 | Iris-versicolor | IRVE2 |
  11. | 6.9 | 3.1 | Iris-versicolor | IRVE2 |
  12. |---------------+-------------+-----------------+----------|

Of course we can also use the SQL approach using csvsql, which is, as per usual, a bit longer (but potentially much more flexible):

  1. $ csvsql --query 'SELECT i.sepal_length, i.sepal_width, i.species, m.usda_id '\
  2. > 'FROM iris i JOIN irismeta m ON (i.species = m.species)' \
  3. > iris.csv irismeta.csv | sed -n '1p;49,54p' | csvlook
  4. |---------------+-------------+-----------------+----------|
  5. | sepal_length | sepal_width | species | usda_id |
  6. |---------------+-------------+-----------------+----------|
  7. | 4.6 | 3.2 | Iris-setosa | IRSE |
  8. | 5.3 | 3.7 | Iris-setosa | IRSE |
  9. | 5.0 | 3.3 | Iris-setosa | IRSE |
  10. | 7.0 | 3.2 | Iris-versicolor | IRVE2 |
  11. | 6.4 | 3.2 | Iris-versicolor | IRVE2 |
  12. | 6.9 | 3.1 | Iris-versicolor | IRVE2 |
  13. |---------------+-------------+-----------------+----------|