4.5 Types and Categorical Data
As discussed in Section 4.1, CSV.jl will do its best to guess what kind of types your data have as columns. However, this won’t always work perfectly. In this section, we show why suitable types are important and we fix wrong data types. To be more clear about the types, we show the text output for DataFrames instead of a pretty-formatted table. In this section, we work with the following dataset:
function wrong_types()id = 1:4date = ["28-01-2018", "03-04-2019", "01-08-2018", "22-11-2020"]age = ["adolescent", "adult", "infant", "adult"]DataFrame(; id, date, age)endwrong_types()
4×3 DataFrameRow │ id date age│ Int64 String String─────┼───────────────────────────────1 │ 1 28-01-2018 adolescent2 │ 2 03-04-2019 adult3 │ 3 01-08-2018 infant4 │ 4 22-11-2020 adult
Because the date column has the wrong type, sorting won’t work correctly:
sort(wrong_types(), :date)
4×3 DataFrameRow │ id date age│ Int64 String String─────┼───────────────────────────────1 │ 3 01-08-2018 infant2 │ 2 03-04-2019 adult3 │ 4 22-11-2020 adult4 │ 1 28-01-2018 adolescent
To fix the sorting, we can use the Date module from Julia’s standard library as described in Section 3.5.1:
function fix_date_column(df::DataFrame)strings2dates(dates::Vector) = Date.(dates, dateformat"dd-mm-yyyy")dates = strings2dates(df[!, :date])df[!, :date] = datesdfendfix_date_column(wrong_types())
4×3 DataFrameRow │ id date age│ Int64 Date String─────┼───────────────────────────────1 │ 1 2018-01-28 adolescent2 │ 2 2019-04-03 adult3 │ 3 2018-08-01 infant4 │ 4 2020-11-22 adult
Now, sorting will work as intended:
df = fix_date_column(wrong_types())sort(df, :date)
4×3 DataFrameRow │ id date age│ Int64 Date String─────┼───────────────────────────────1 │ 1 2018-01-28 adolescent2 │ 3 2018-08-01 infant3 │ 2 2019-04-03 adult4 │ 4 2020-11-22 adult
For the age column, we have a similar problem:
sort(wrong_types(), :age)
4×3 DataFrameRow │ id date age│ Int64 String String─────┼───────────────────────────────1 │ 1 28-01-2018 adolescent2 │ 2 03-04-2019 adult3 │ 4 22-11-2020 adult4 │ 3 01-08-2018 infant
This isn’t right, because an infant is younger than adults and adolescents. The solution for this issue and any sort of categorical data is to use CategoricalArrays.jl:
4.5.1 CategoricalArrays.jl
using CategoricalArrays
With the CategoricalArrays.jl package, we can add levels that represent the ordering of our categorical variable to our data:
function fix_age_column(df)levels = ["infant", "adolescent", "adult"]ages = categorical(df[!, :age]; levels, ordered=true)df[!, :age] = agesdfendfix_age_column(wrong_types())
4×3 DataFrameRow │ id date age│ Int64 String Cat…─────┼───────────────────────────────1 │ 1 28-01-2018 adolescent2 │ 2 03-04-2019 adult3 │ 3 01-08-2018 infant4 │ 4 22-11-2020 adult
NOTE: Also note that we are passing the argument
ordered=truewhich tellsCategoricalArrays.jl’scategoricalfunction that our categorical data is “ordered.” Without this any type of sorting or bigger/smaller comparisons would not be possible.
Now, we can sort the data correctly on the age column:
df = fix_age_column(wrong_types())sort(df, :age)
4×3 DataFrameRow │ id date age│ Int64 String Cat…─────┼───────────────────────────────1 │ 3 01-08-2018 infant2 │ 1 28-01-2018 adolescent3 │ 2 03-04-2019 adult4 │ 4 22-11-2020 adult
Because we have defined convenient functions, we can now define our fixed data by just performing the function calls:
function correct_types()df = wrong_types()df = fix_date_column(df)df = fix_age_column(df)endcorrect_types()
4×3 DataFrameRow │ id date age│ Int64 Date Cat…─────┼───────────────────────────────1 │ 1 2018-01-28 adolescent2 │ 2 2019-04-03 adult3 │ 3 2018-08-01 infant4 │ 4 2020-11-22 adult
Since age in our data is ordinal (ordered=true), we can properly compare categories of age:
df = correct_types()a = df[1, :age]b = df[2, :age]a < b
true
which would give wrong comparisons if the element type were strings:
"infant" < "adult"
false
Support this project
CC BY-NC-SA 4.0 Jose Storopoli, Rik Huijzer, Lazaro Alonso
Support this project