7.2 Inspecting Data and its Properties

In this section we’ll demonstrate how to inspect your data set and its properties. Because the upcoming visualization and modeling techniques expect the data to be in tabular format, we’ll assume that the data is in CSV format. You can use the techniques described in Chapter 5 to convert your data to CSV if necessary.

For simplicity sake, we’ll also assume that your data has a header. In the first subsection we are going to determine whether that is the case. Once we know we have a header, we can continue answering the following questions:

  • How many data points and features does the data set have?
  • What does the raw data look like?
  • What kind of features does the data set have?
  • Can some of these features be treated as categorical or as factors?

7.2.1 Header Or Not, Here I Come

You can check whether your file has a header by printing the first few lines:

  1. $ #? [echo]
  2. $ head file.csv | csvlook

It is then up to you to decide whether the first line is indeed a header or already the first data point. When the data set contains no header or when its header contains newlines, you’re best off going back to Chapter 5 and correct that.

7.2.2 Inspect All The Data

If you want to inspect the raw data, then it’s best not to use the cat command-line tool, since cat prints all the data to the screen in one go. In order to inspect the raw data at your own pace, we recommend to use less (Nudelman 2013) with the -S command-line argument:

  1. $ #? [echo]
  2. $ less -S file.csv

The -S command-line argument ensures that long lines are not being wrapped when they don’t fit in the terminal. Instead, less allows you to scroll horizontally to see the rest of the lines. The advantage of less is that it does not load the entire file into memory, which is good for viewing large files. Once you’re in less, you can scroll down a full screen by pressing <Space>. Scrolling horizontally is done by pressing <Left> and <Right>. Press g and G to go to start and the end of the file, respectively. Quiting less is done by pressing q. Read the man page for more key bindings.

If you want the data set to be nicely formatted, you can add in csvlook:

  1. $ #? [echo]
  2. $ < file.csv csvlook | less -S

Unfortunately, csvlook needs to read the entire file into memory in order to determine the width of the columns. So, when you want to inspect a very large file, then either you may want to get a subset (using sample, for example) or you may need to be patient.

7.2.3 Feature Names and Data Types

In order to gain insight into the data set, it is useful to print the feature names and study them. After all, the feature names may indicate the meaning of the feature. You can use the following sed expression for this:

  1. $ < data/iris.csv sed -e 's/,/\n/g;q'

Note that this basic command assumes that the file is delimited by commas. Just as reminder: if you intend to use this command often, you could define a function in your .bashrc file called, say, names:

Example 7.1 ()

  1. names () { sed -e 's/,/\n/g;q'; }

Which you can then you use like this:

  1. $ < data/investments.csv names
  2. company_permalink
  3. company_name
  4. company_category_list
  5. company_market
  6. company_country_code
  7. company_state_code
  8. company_region
  9. company_city
  10. investor_permalink
  11. investor_name
  12. investor_category_list
  13. investor_market
  14. investor_country_code
  15. investor_state_code
  16. investor_region
  17. investor_city
  18. funding_round_permalink
  19. funding_round_type
  20. funding_round_code
  21. funded_at
  22. funded_month
  23. funded_quarter
  24. funded_year
  25. raised_amount_usd

We can go a step further than just printing the column names. Besides the names of the columns, it would be very useful to know what type of values each column contains. Examples of data types are a string of characters, a numerical value, or a date. Assume that we have the following toy data set:

  1. $ < data/datatypes.csv csvlook
  2. |-----+--------+-------+----------+------------------+------------+----------|
  3. | a | b | c | d | e | f | g |
  4. |-----+--------+-------+----------+------------------+------------+----------|
  5. | 2 | 0.0 | FALSE | "Yes!" | 2011-11-11 11:00 | 2012-09-08 | 12:34 |
  6. | 42 | 3.1415 | TRUE | Oh, good | 2014-09-15 | 12/6/70 | 0:07 PM |
  7. | 66 | | False | 2198 | | | |
  8. |-----+--------+-------+----------+------------------+------------+----------|

We’ve already used csvsql in Chapter 5 to execute SQL queries directly on CSV data. When no command-line arguments are passed, it generates the necessary SQL statement that would be needed if we were to insert this data into an actual database. We can use the output also for ourselves to inspect what the inferred column types are:

  1. csvsql data/datatypes.csv
  2. CREATE TABLE datatypes (
  3. a INTEGER NOT NULL,
  4. b FLOAT,
  5. c BOOLEAN NOT NULL,
  6. d VARCHAR(8) NOT NULL,
  7. e DATETIME,
  8. f DATE,
  9. g TIME,
  10. CHECK (c IN (0, 1))
  11. );

provides on overview of what the various SQL data types mean. If a column has the NOT NULL string printed after the data type, then that column contains no missing values.

Python versus SQL data types
TypePythonSQL
Character stringunicodeVARCHAR
BooleanboolBOOLEAN
IntegerintINTEGER
Real numberfloatFLOAT
Datedatetime.dateDATE
Timedatetime.timeTIME
Date and timedatetime.datetimeDATETIME

7.2.4 Unique Identifiers, Continuous Variables, and Factors

Knowing the data type of each feature is not enough. It is also essential to know what each feature represents. Having knowledge about the domain is very useful here, however we may also get some ideas from the data itself.

Both a string and an integer could be a unique identifier or could represent a category. In the latter case, this could be used to assign a color to your visualization. If an integer denotes, say, the ZIP Code, then it doesn’t make sense to compute the average.

To determine whether a feature should be treated as a unique identifier or categorical variable (or factor in R terms), you could count the number of unique values for a specific column:

  1. $ cat data/iris.csv | csvcut -c species | body "sort | uniq | wc -l"
  2. species
  3. 3

Or we can use csvstat (Groskopf 2014a), which is part of csvkit, to get the number of unique values for each column:

  1. $ csvstat data/investments2.csv --unique
  2. 1. company_permalink: 27342
  3. 2. company_name: 27324
  4. 3. company_category_list: 8759
  5. 4. company_market: 443
  6. 5. company_country_code: 150
  7. 6. company_state_code: 147
  8. 7. company_region: 1079
  9. 8. company_city: 3305
  10. 9. investor_permalink: 11176
  11. 10. investor_name: 11135
  12. 11. investor_category_list: 468
  13. 12. investor_market: 134
  14. 13. investor_country_code: 111
  15. 14. investor_state_code: 80
  16. 15. investor_region: 549
  17. 16. investor_city: 1198
  18. 17. funding_round_permalink: 41790
  19. 18. funding_round_type: 13
  20. 19. funding_round_code: 15
  21. 20. funded_at: 3595
  22. 21. funded_month: 295
  23. 22. funded_quarter: 121
  24. 23. funded_year: 34
  25. 24. raised_amount_usd: 6143

If the number of unique values is low compared to the number of rows, then that feature may indeed be treated as a categorical one (such as funding_round_type). If the number is equal to the number of rows, it may be a unique identifier (such as company_permalink).