5.3 Working with CSV

5.3.1 Bodies and Headers and Columns, Oh My!

The command-line tools that we’ve used to scrub plain text, such as tr and grep, cannot always be applied to CSV. The reason is that these command-line tools have no notion of headers, bodies, and columns. What if we wanted to filter lines using grep but always include the header in the output? Or what if we only wanted to uppercase the values of a specific column using tr and leave the other columns untouched? There are multi-step workarounds for this, but they are very cumbersome. We have something better. In order to leverage ordinary command-line tools for CSV, we’d like to introduce you to three command-line tools, aptly named: body (Janssens 2014a), header (Janssens 2014c), and cols (Janssens 2014b).

Let’s start with the first command-line tool, body. With body you can apply any command-line tool to the body of a CSV file, that is, everything excluding the header. For example:

  1. $ echo -e "value\n7\n2\n5\n3" | body sort -n
  2. value
  3. 2
  4. 3
  5. 5
  6. 7

It assumes that the header of the CSV file only spans one row. Here’s the source code for completeness:

  1. #!/usr/bin/env bash
  2. IFS= read -r header
  3. printf '%s\n' "$header"
  4. $@

It works like this:

  • Take one line from standard in and store it as a variable named $header.
  • Print out the header.
  • Execute all the command-line arguments passed to body on the remaining data in standard in.Here’s another example. Imagine that we count the lines of the following CSV file:
  1. $ seq 5 | header -a count
  2. count
  3. 1
  4. 2
  5. 3
  6. 4
  7. 5

With wc -l, we can count the number of all lines:

  1. $ seq 5 | header -a count | wc -l
  2. 6

If we only want to consider the lines in the body (so everything except the header), we simply add body:

  1. $ seq 5 | header -a count | body wc -l
  2. count
  3. 5

Note that the header is not used and is also printed again in the output.

The second command-line tool, header allows us, as the name implies, to manipulate the header of a CSV file. The complete source code is as follows:

  1. #!/usr/bin/env bash
  2. get_header () {
  3. for i in $(seq $NUMROWS); do
  4. IFS= read -r LINE
  5. OLDHEADER="${OLDHEADER}${LINE}\n"
  6. done
  7. }
  8. print_header () {
  9. echo -ne "$1"
  10. }
  11. print_body () {
  12. cat
  13. }
  14. OLDHEADER=
  15. NUMROWS=1
  16. while getopts "dn:ha:r:e:" OPTION
  17. do
  18. case $OPTION in
  19. n)
  20. NUMROWS=$OPTARG
  21. ;;
  22. a)
  23. print_header "$OPTARG\n"
  24. print_body
  25. exit 1
  26. ;;
  27. d)
  28. get_header
  29. print_body
  30. exit 1
  31. ;;
  32. r)
  33. get_header
  34. print_header "$OPTARG\n"
  35. print_body
  36. exit 1
  37. ;;
  38. e)
  39. get_header
  40. print_header "$(echo -ne $OLDHEADER | eval $OPTARG)\n"
  41. print_body
  42. exit 1
  43. ;;
  44. h)
  45. usage
  46. exit 1
  47. ;;
  48. esac
  49. done
  50. get_header
  51. print_header $OLDHEADER

If no argument are provided, the header of the CSV file is printed:

  1. $ < tips.csv | header
  2. bill,tip,sex,smoker,day,time,size

This is the same as head -n 1. If the header spans more than one row, which is not recommended, you can specify -n 2. We can also add a header to a CSV file:

  1. $ seq 5 | header -a count
  2. count
  3. 1
  4. 2
  5. 3
  6. 4
  7. 5

This is equivalent to echo "count" | cat - <(seq 5). Deleting a header is done with the -d command-line argument:

  1. $ < iris.csv | header -d | head
  2. 5.1,3.5,1.4,0.2,Iris-setosa
  3. 4.9,3.0,1.4,0.2,Iris-setosa
  4. 4.7,3.2,1.3,0.2,Iris-setosa
  5. 4.6,3.1,1.5,0.2,Iris-setosa
  6. 5.0,3.6,1.4,0.2,Iris-setosa
  7. 5.4,3.9,1.7,0.4,Iris-setosa
  8. 4.6,3.4,1.4,0.3,Iris-setosa
  9. 5.0,3.4,1.5,0.2,Iris-setosa
  10. 4.4,2.9,1.4,0.2,Iris-setosa
  11. 4.9,3.1,1.5,0.1,Iris-setosa

This is similar to tail -n +2, but it’s a bit easier to remember. Replacing a header, which is basically first deleting a header and then adding one if you look at the above source code, is accomplished with specifying -r. Here, we combine it with body:

  1. $ seq 5 | header -a line | body wc -l | header -r count
  2. count
  3. 5

And last but not least, we can apply a command to just the header, similar to what the body command-line tool does to the body:

  1. $ seq 5 | header -a line | header -e "tr '[a-z]' '[A-Z]'"
  2. LINE
  3. 1
  4. 2
  5. 3
  6. 4
  7. 5

The third command-line tool is called cols, which is similar to header and body in that it allows you to apply a certain command to only a subset of the columns. The code is as follows:

  1. #!/usr/bin/env bash
  2. ARG="$1"
  3. shift
  4. COLUMNS="$1"
  5. shift
  6. EXPR="$@"
  7. DIRTMP=$(mktemp -d)
  8. mkfifo $DIRTMP/other_columns
  9. tee $DIRTMP/other_columns | csvcut $ARG $COLUMNS | ${EXPR} |
  10. paste -d, - <(csvcut ${ARG~~} $COLUMNS $DIRTMP/other_columns)
  11. rm -rf $DIRTMP

For example, if we wanted to uppercase the values in the day column in the tips data set (without affecting the other columns and the header), we would use cols in combination with body, as follows:

  1. $ < tips.csv cols -c day body "tr '[a-z]' '[A-Z]'" | head -n 5 | csvlook -I
  2. |------+-------+------+--------+--------+--------+-------|
  3. | day | bill | tip | sex | smoker | time | size |
  4. |------+-------+------+--------+--------+--------+-------|
  5. | SUN | 16.99 | 1.01 | Female | No | Dinner | 2 |
  6. | SUN | 10.34 | 1.66 | Male | No | Dinner | 3 |
  7. | SUN | 21.01 | 3.5 | Male | No | Dinner | 3 |
  8. | SUN | 23.68 | 3.31 | Male | No | Dinner | 2 |
  9. |------+-------+------+--------+--------+--------+-------|

Please note that passing multiple command-line tools and arguments as command to header -e, body, and cols can lead to tricky quoting citations. If you ever run in such problems, it is best to create a separate command-line tool for this and pass that as command.

In conclusion, while it is generally preferable to use command-line tools which are specifically made for CSV data, body, header, and cols also allow you to apply the classic command-line tools to CSV files if needed.

5.3.2 Performing SQL Queries on CSV

In case the command-line tools mentioned in this chapter do not provide enough flexibility, then there is another approach to scrub your data from the command line. The command-line tool csvsql (Groskopf 2014f) allows you to execute SQL queries directly on CSV files. As you may know, SQL is a very powerful language to define operations for scrubbing data; it is a very different way than using individual command-line tools.

If your data originally comes from a relational database, then, if possible, try to execute SQL queries on that database and subsequently extract the data as CSV. As discussed in Chapter 3, you can use the command-line tool sql2csv for this. When you first export data from the database to a CSV file, and then apply SQL, it is not only slower, but there is also a possibility that the column types are not correctly inferred from the CSV data.

In the scrubbing tasks below, we’ll include several solutions that involve csvsql. The basic command is this:

  1. $ seq 5 | header -a value | csvsql --query "SELECT SUM(value) AS sum FROM stdin"
  2. sum
  3. 15

If you pass standard input to csvsql, then the table is named stdin. The types of the column are automatically inferred from the data. As you will see later, in the combining CSV files section, you can also specify multiple CSV files. Please keep in mind that csvsql employs SQLite dialect. While SQL is generally more verbose than the other solutions, it is also much more flexible. If you already know how to tackle a scrubbing problem with SQL, then there’s no shame in using it from the command line!