4.3 Filter and Subset

There are two ways to remove rows from a DataFrame, one is filter (Section 4.3.1) and the other is subset (Section 4.3.2). filter was added earlier to DataFrames.jl, is more powerful and more consistent with syntax from Julia base, so that is why we start discussing filter first. subset is newer and often more convenient.

4.3.1 Filter

From this point on, we start to get into the more powerful features of DataFrames.jl. To do this, we need to learn some functions, such as select and filter. But don’t worry! It might be a relief to know that the general design goal of DataFrames.jl is to keep the number of functions that a user has to learn to a minimum15.

Like before, we resume from the grades_2020:

  1. grades_2020()
namegrade_2020
Sally1.0
Bob5.0
Alice8.5
Hank4.0

We can filter rows by using filter(source => f::Function, df). Note how this function is very similar to the function filter(f::Function, V::Vector) from Julia Base module. This is because DataFrames.jl uses multiple dispatch (see Section 2.3.3) to define a new method of filter that accepts a DataFrame as argument.

At first sight, defining and working with a function f for filtering can be a bit difficult to use in practice. Hold tight, that effort is well-paid, since it is a very powerful way of filtering data. As a simple example, we can create a function equals_alice that checks whether its input equals “Alice”:

  1. equals_alice(name::String) = name == "Alice"
  2. JDS.equals_alice("Bob")
  1. false
  1. equals_alice("Alice")
  1. true

Equipped with such a function, we can use it as our function f to filter out all the rows for which name equals “Alice”:

  1. filter(:name => equals_alice, grades_2020())
namegrade_2020
Alice8.5

Note that this doesn’t only work for DataFrames, but also for vectors:

  1. filter(equals_alice, ["Alice", "Bob", "Dave"])
  1. ["Alice"]

We can make it a bit less verbose by using an anonymous function (see Section 3.2.4.4):

  1. filter(n -> n == "Alice", ["Alice", "Bob", "Dave"])
  1. ["Alice"]

which we can also use on grades_2020:

  1. filter(:name => n -> n == "Alice", grades_2020())
namegrade_2020
Alice8.5

To recap, this function call can be read as “for each element in the column :name, let’s call the element n, check whether n equals Alice.” For some people, this is still too verbose. Luckily, Julia has added a partial function application of ==. The details are not important – just know that you can use it just like any other function:

  1. filter(:name => ==("Alice"), grades_2020())
namegrade_2020
Alice8.5

To get all the rows which are not Alice, == (equality) can be replaced by != (inequality) in all previous examples:

  1. filter(:name => !=("Alice"), grades_2020())
namegrade_2020
Sally1.0
Bob5.0
Hank4.0

Now, to show why functions are so powerful, we can come up with a slightly more complex filter. In this filter, we want to have the people whose names start with A or B and have a grade above 6:

  1. function complex_filter(name, grade)::Bool
  2. interesting_name = startswith(name, 'A') || startswith(name, 'B')
  3. interesting_grade = 6 < grade
  4. interesting_name && interesting_grade
  5. end
  1. filter([:name, :grade_2020] => complex_filter, grades_2020())
namegrade_2020
Alice8.5

4.3.2 Subset

The subset function was added to make it easier to work with missing values (Section 4.9). In contrast to filter, subset works on complete columns instead of rows or single values. If we want to use our earlier defined functions, we should wrap it inside ByRow:

  1. subset(grades_2020(), :name => ByRow(equals_alice))
namegrade_2020
Alice8.5

Also note that the DataFrame is now the first argument subset(df, args...), whereas in filter it was the second one filter(f, df). The reason for this is that Julia defines filter as filter(f, V::Vector) and DataFrames.jl chose to maintain consistency with existing Julia functions that were extended to DataFrames types by multiple dispatch.

NOTE: Most of native DataFrames.jl functions, which subset belongs to, have a consistent function signature that always takes a DataFrame as first argument.

Just like with filter, we can also use anonymous functions inside subset:

  1. subset(grades_2020(), :name => ByRow(name -> name == "Alice"))
namegrade_2020
Alice8.5

Or, the partial function application for ==:

  1. subset(grades_2020(), :name => ByRow(==("Alice")))
namegrade_2020
Alice8.5

Ultimately, let’s show the real power of subset. First, we create a dataset with some missing values:

  1. function salaries()
  2. names = ["John", "Hank", "Karen", "Zed"]
  3. salary = [1_900, 2_800, 2_800, missing]
  4. DataFrame(; names, salary)
  5. end
  6. salaries()
Table 6: Salaries.
namessalary
John1900
Hank2800
Karen2800
Zedmissing

This data is about a plausible situation where you want to figure out your colleagues’ salaries, and haven’t figured it out for Zed yet. Even though we don’t want to encourage these practices, we suspect it is an interesting example. Suppose we want to know who earns more than 2000. If we use filter, without taking the missing values into account, it will fail:

  1. filter(:salary => >(2_000), salaries())
  1. TypeError: non-boolean (Missing) used in boolean context
  2. Stacktrace:
  3. [1] (::DataFrames.var"#99#100"{Base.Fix2{typeof(>), Int64}})(x::Missing)
  4. @ DataFrames ~/.julia/packages/DataFrames/dgZn3/src/abstractdataframe/abstractdataframe.jl:1178
  5. ...

subset will also fail, but it will fortunately point us towards an easy solution:

  1. subset(salaries(), :salary => ByRow(>(2_000)))
  1. ArgumentError: missing was returned in condition number 1 but only true or false are allowed; pass skipmissing=true to skip missing values
  2. Stacktrace:
  3. [1] _and(x::Missing)
  4. @ DataFrames ~/.julia/packages/DataFrames/dgZn3/src/abstractdataframe/subset.jl:11
  5. ...

So, we just need to pass the keyword argument skipmissing=true:

  1. subset(salaries(), :salary => ByRow(>(2_000)); skipmissing=true)
namessalary
Hank2800
Karen2800

4.3 Filter and Subset - 图1 Support this project
CC BY-NC-SA 4.0 Jose Storopoli, Rik Huijzer, Lazaro Alonso