4.5 Types and Categorical Data

As discussed in Section 4.1, CSV.jl will do its best to guess what kind of types your data have as columns. However, this won’t always work perfectly. In this section, we show why suitable types are important and we fix wrong data types. To be more clear about the types, we show the text output for DataFrames instead of a pretty-formatted table. In this section, we work with the following dataset:

  1. function wrong_types()
  2. id = 1:4
  3. date = ["28-01-2018", "03-04-2019", "01-08-2018", "22-11-2020"]
  4. age = ["adolescent", "adult", "infant", "adult"]
  5. DataFrame(; id, date, age)
  6. end
  7. wrong_types()
  1. 4×3 DataFrame
  2. Row id date age
  3. Int64 String String
  4. ─────┼───────────────────────────────
  5. 1 1 28-01-2018 adolescent
  6. 2 2 03-04-2019 adult
  7. 3 3 01-08-2018 infant
  8. 4 4 22-11-2020 adult

Because the date column has the wrong type, sorting won’t work correctly:

  1. sort(wrong_types(), :date)
  1. 4×3 DataFrame
  2. Row id date age
  3. Int64 String String
  4. ─────┼───────────────────────────────
  5. 1 3 01-08-2018 infant
  6. 2 2 03-04-2019 adult
  7. 3 4 22-11-2020 adult
  8. 4 1 28-01-2018 adolescent

To fix the sorting, we can use the Date module from Julia’s standard library as described in Section 3.5.1:

  1. function fix_date_column(df::DataFrame)
  2. strings2dates(dates::Vector) = Date.(dates, dateformat"dd-mm-yyyy")
  3. dates = strings2dates(df[!, :date])
  4. df[!, :date] = dates
  5. df
  6. end
  7. fix_date_column(wrong_types())
  1. 4×3 DataFrame
  2. Row id date age
  3. Int64 Date String
  4. ─────┼───────────────────────────────
  5. 1 1 2018-01-28 adolescent
  6. 2 2 2019-04-03 adult
  7. 3 3 2018-08-01 infant
  8. 4 4 2020-11-22 adult

Now, sorting will work as intended:

  1. df = fix_date_column(wrong_types())
  2. sort(df, :date)
  1. 4×3 DataFrame
  2. Row id date age
  3. Int64 Date String
  4. ─────┼───────────────────────────────
  5. 1 1 2018-01-28 adolescent
  6. 2 3 2018-08-01 infant
  7. 3 2 2019-04-03 adult
  8. 4 4 2020-11-22 adult

For the age column, we have a similar problem:

  1. sort(wrong_types(), :age)
  1. 4×3 DataFrame
  2. Row id date age
  3. Int64 String String
  4. ─────┼───────────────────────────────
  5. 1 1 28-01-2018 adolescent
  6. 2 2 03-04-2019 adult
  7. 3 4 22-11-2020 adult
  8. 4 3 01-08-2018 infant

This isn’t right, because an infant is younger than adults and adolescents. The solution for this issue and any sort of categorical data is to use CategoricalArrays.jl:

4.5.1 CategoricalArrays.jl

  1. using CategoricalArrays

With the CategoricalArrays.jl package, we can add levels that represent the ordering of our categorical variable to our data:

  1. function fix_age_column(df)
  2. levels = ["infant", "adolescent", "adult"]
  3. ages = categorical(df[!, :age]; levels, ordered=true)
  4. df[!, :age] = ages
  5. df
  6. end
  7. fix_age_column(wrong_types())
  1. 4×3 DataFrame
  2. Row id date age
  3. Int64 String Cat
  4. ─────┼───────────────────────────────
  5. 1 1 28-01-2018 adolescent
  6. 2 2 03-04-2019 adult
  7. 3 3 01-08-2018 infant
  8. 4 4 22-11-2020 adult

NOTE: Also note that we are passing the argument ordered=true which tells CategoricalArrays.jl’s categorical function that our categorical data is “ordered.” Without this any type of sorting or bigger/smaller comparisons would not be possible.

Now, we can sort the data correctly on the age column:

  1. df = fix_age_column(wrong_types())
  2. sort(df, :age)
  1. 4×3 DataFrame
  2. Row id date age
  3. Int64 String Cat
  4. ─────┼───────────────────────────────
  5. 1 3 01-08-2018 infant
  6. 2 1 28-01-2018 adolescent
  7. 3 2 03-04-2019 adult
  8. 4 4 22-11-2020 adult

Because we have defined convenient functions, we can now define our fixed data by just performing the function calls:

  1. function correct_types()
  2. df = wrong_types()
  3. df = fix_date_column(df)
  4. df = fix_age_column(df)
  5. end
  6. correct_types()
  1. 4×3 DataFrame
  2. Row id date age
  3. Int64 Date Cat
  4. ─────┼───────────────────────────────
  5. 1 1 2018-01-28 adolescent
  6. 2 2 2019-04-03 adult
  7. 3 3 2018-08-01 infant
  8. 4 4 2020-11-22 adult

Since age in our data is ordinal (ordered=true), we can properly compare categories of age:

  1. df = correct_types()
  2. a = df[1, :age]
  3. b = df[2, :age]
  4. a < b
  1. true

which would give wrong comparisons if the element type were strings:

  1. "infant" < "adult"
  1. false

4.5 Types and Categorical Da.. - 图1 Support this project
CC BY-NC-SA 4.0 Jose Storopoli, Rik Huijzer, Lazaro Alonso