7.3 Computing Descriptive Statistics

7.3.1 csvstat

The command-line tool csvstat gives a lot of information. For each feature (column), it shows:

  • The data type in Python terminology (see Table 7-1 for a comparison between Python and SQL data types).
  • Whether it has any missing values (nulls).
  • The number of unique values.
  • Various descriptive statistics (maximum, minimum, sum, mean, standard deviation, and median) for those features for which it is appropriate.We invoke csvstat as follows:
  1. $ csvstat data/datatypes.csv
  2. 1. a
  3. <type 'int'>
  4. Nulls: False
  5. Values: 2, 66, 42
  6. 2. b
  7. <type 'float'>
  8. Nulls: True
  9. Values: 0.0, 3.1415
  10. 3. c
  11. <type 'bool'>
  12. Nulls: False
  13. Unique values: 2
  14. 5 most frequent values:
  15. False: 2
  16. True: 1
  17. 4. d
  18. <type 'unicode'>
  19. Nulls: False
  20. Values: 2198, "Yes!", Oh, good
  21. 5. e
  22. <type 'datetime.datetime'>
  23. Nulls: True
  24. Values: 2011-11-11 11:00:00, 2014-09-15 00:00:00
  25. 6. f
  26. <type 'datetime.date'>
  27. Nulls: True
  28. Values: 2012-09-08, 1970-12-06
  29. 7. g
  30. <type 'datetime.time'>
  31. Nulls: True
  32. Values: 12:34:00, 12:07:00
  33. Row count: 3

This gives a very verbose output. For a more concise output specify one of the statistics arguments:

  • —max (maximum)
  • —min (minimum)
  • —sum (sum)
  • —mean (mean)
  • —median (median)
  • —stdev (standard deviation)

  • —nulls (whether column contains nulls)

  • —unique (unique values)
  • —freq (frequent values)
  • —len (max value length)

For example:

  1. $ csvstat data/datatypes.csv --null
  2. 1. a: False
  3. 2. b: True
  4. 3. c: False
  5. 4. d: False
  6. 5. e: True
  7. 6. f: True
  8. 7. g: True

You can select a subset of features with the -c command-line argument. This accepts both integers and column names:

  1. $ csvstat data/investments2.csv -c 2,13,19,24
  2. 2. company_name
  3. <type 'unicode'>
  4. Nulls: True
  5. Unique values: 27324
  6. 5 most frequent values:
  7. Aviir: 13
  8. Galectin Therapeutics: 12
  9. Rostima: 12
  10. Facebook: 11
  11. Lending Club: 11
  12. Max length: 66
  13. 13. investor_country_code
  14. <type 'unicode'>
  15. Nulls: True
  16. Unique values: 111
  17. 5 most frequent values:
  18. USA: 20806
  19. GBR: 2357
  20. DEU: 946
  21. CAN: 893
  22. FRA: 737
  23. Max length: 15
  24. 19. funding_round_code
  25. <type 'unicode'>
  26. Nulls: True
  27. Unique values: 15
  28. 5 most frequent values:
  29. a: 7529
  30. b: 4776
  31. c: 2452
  32. d: 1042
  33. e: 384
  34. Max length: 10
  35. 24. raised_amount_usd
  36. <type 'int'>
  37. Nulls: True
  38. Min: 0
  39. Max: 3200000000
  40. Sum: 359891203117
  41. Mean: 10370010.1748
  42. Median: 3250000
  43. Standard Deviation: 38513119.1802
  44. Unique values: 6143
  45. 5 most frequent values:
  46. 10000000: 1159
  47. 1000000: 1074
  48. 5000000: 1066
  49. 2000000: 875
  50. 3000000: 820
  51. Row count: 41799

Please note that csvstat, just like csvsql, employs heuristics to determine the data type, and therefore may not always get it right. We encourage you to always do a manual inspection as discussed in the previous subsection. Moreover, the type may be a character string or integer that doesn’t say anything about how it should be used.

As a nice extra, csvstat outputs, at the very end, the number of data points (rows). Newlines and commas inside values are handles correctly. To only see the relevant line, we can use tail:

  1. $ csvstat data/iris.csv | tail -n 1

If you only want to see the actual number number of data points, you can use, for example, the following sed expression to extract the number:

  1. $ csvstat data/iris.csv | sed -rne '${s/^([^:]+): ([0-9]+)$/\2/;p}'

7.3.2 Using R from the Command Line using Rio

In this section we would like to introduce you to a command-line tool called Rio, which is essentially a small, nifty wrapper around the statistical programming environment R. Before we explain what Rio does and why it exists, lets talk a bit about R itself.

R is a very powerful statistical software package to analyze data and create visualizations. It’s an interpreted programming language, has an extensive collection of packages, and offers its own REPL (Read-Eval-Print-Loop), which allows you, similar to the command line, to play with your data. Unfortunately, R is quite separated from the command line. Once you start it, you’re in a separate environment. R doesn’t really play well with the command line because you cannot pipe any data into it and it also doesn’t support any one-liners that you can specify.

For example, imagine that you have a CSV file called tips.csv, and you would like compute the tip percentage, and save the result. To accomplish this in R you would first startup R:

  1. $ #? [echo]
  2. $ R

And then run the following commands:

  1. > tips <- read.csv('tips.csv', header = T, sep = ',', stringsAsFactors = F)
  2. > tips.percent <- tips$tip / tips$bill * 100
  3. > cat(tips.percent, sep = '\n', file = 'percent.csv')
  4. > q("no")

Afterwards, you can continue with the saved file percent.csv on the command line. Note that there is only one command that is associated with what we want to accomplish specifically. The other commands are necessary boilerplate. Typing in this boilerplate in order to accomplish something simple is cumbersome and breaks your workflow. Sometimes, you only want to do one or two things at a time to your data. Wouldn’t it be great if we could harness the power of R and be able to use it from the command line?

This is where Rio comes in. The name Rio stands for R input/output, because it enables you to use R as a filter on the command line. You simply pipe CSV data into Rio and you specify the R commands that you want to run on it. Let’s perform the same task as before, but now using Rio:

  1. $ < data/tips.csv Rio -e 'df$tip / df$bill * 100' | head -n 10

Rio can execute multiple R command that are separated by semicolons. So, if you wanted to add a column called percent to the input data, you could do the following:

  1. $ < data/tips.csv Rio -e 'df$percent <- df$tip / df$bill * 100; df' | head

These small one-liners are possible because Rio takes care of all the boilerplate. Being able to use the command line for this and capture the power of R into a one-liner is fantastic, especially if you want to keep on working on the command line. Rio assumes that the input data is in CSV format with a header. (By specifying the -n command-line argument Rio does not consider the first row to be the header and creates default column names.) Behind the scenes, Rio writes the piped data to a temporary CSV file and creates a script that:

  • Import required libraries.
  • Loads the CSV file as a data frame.
  • Generates a ggplot2 object if needed (more on this in the next section).
  • Runs the specified commands.

  • Prints the result of the last command to standard output.

So now, if you wanted to do one or two things to your data set with R, you can specify it as a one-liner, and keep on working on the command line. All the knowledge that you already have about R can now be used from the command line. With Rio, you can even create sophisticated visualizations, as you will see later in this chapter.

Rio doesn’t have to be used as a filter, meaning the output doesn’t have to be a in CSV format per se. You can compute

  1. $ < data/iris.csv Rio -e 'mean(df$sepal_length)'
  2. $ < data/iris.csv Rio -e 'sd(df$sepal_length)'
  3. $ < data/iris.csv Rio -e 'sum(df$sepal_length)'

If we wanted to compute the five summary statistics, we would do:

  1. $ < iris.csv Rio -e 'summary(df$sepal_length)'
  2. Min. 1st Qu. Median Mean 3rd Qu. Max.
  3. 4.300 5.100 5.800 5.843 6.400 7.900

You can also compute the skewness (symmetry of the distribution) and kurtosis (peakedness of the distribution), but then you need to have the moments package installed:

  1. $ #? [echo]
  2. $ < data/iris.csv Rio -e 'skewness(df$sepal_length)'
  3. $ < data/iris.csv Rio -e 'kurtosis(df$petal_width)'

Correlation between two features:

  1. $ < tips.csv Rio -e 'cor(df$bill, df$tip)'
  2. 0.6757341

Or a correlation matrix:

  1. $ < data/tips.csv csvcut -c bill,tip | Rio -f cor | csvlook
  2. |--------------------+--------------------|
  3. | bill | tip |
  4. |--------------------+--------------------|
  5. | 1 | 0.675734109211365 |
  6. | 0.675734109211365 | 1 |
  7. |--------------------+--------------------|

Note that with the command-line argument -f, we can specify the function to apply to the data frame df. In this case, it is the same as -e cor(df).

You can even create a stem plot (Tukey 1977) using Rio:

  1. $ < data/iris.csv Rio -e 'stem(df$sepal_length)'