4.2 Index and Summarize

Let’s go back to the example grades_2020() data defined before:

  1. grades_2020()
namegrade_2020
Sally1.0
Bob5.0
Alice8.5
Hank4.0

To retrieve a vector for name, we can access the DataFrame with the ., as we did previously with structs in Section 3:

  1. function names_grades1()
  2. df = grades_2020()
  3. df.name
  4. end
  5. JDS.names_grades1()
  1. ["Sally", "Bob", "Alice", "Hank"]

or we can index a DataFrame much like an Array with symbols and special characters. The second index is the column indexing:

  1. function names_grades2()
  2. df = grades_2020()
  3. df[!, :name]
  4. end
  5. JDS.names_grades2()
  1. ["Sally", "Bob", "Alice", "Hank"]

Note that df.name is exactly the same as df[!, :name], which you can verify yourself by doing:

  1. julia> df = DataFrame(id=[1]);
  2. julia> @edit df.name

In both cases, it gives you the column :name. There also exists df[:, :name] which copies the column :name. In most cases, df[!, :name] is the best bet since it is more versatile and does an in-place modification.

For any row, say the second row, we can use the first index as row indexing:

  1. df = grades_2020()
  2. df[2, :]
namegrade_2020
Bob5.0

or create a function to give us any row i we want:

  1. function grade_2020(i::Int)
  2. df = grades_2020()
  3. df[i, :]
  4. end
  5. JDS.grade_2020(2)
namegrade_2020
Bob5.0

We can also get only names for the first 2 rows using slicing (again similar to an Array):

  1. grades_indexing(df) = df[1:2, :name]
  2. JDS.grades_indexing(grades_2020())
  1. ["Sally", "Bob"]

If we assume that all names in the table are unique, we can also write a function to obtain the grade for a person via their name. To do so, we convert the table back to one of Julia’s basic data structures (see Section 3.3) which is capable of creating mappings, namely Dicts:

  1. function grade_2020(name::String)
  2. df = grades_2020()
  3. dic = Dict(zip(df.name, df.grade_2020))
  4. dic[name]
  5. end
  6. grade_2020("Bob")
  1. 5.0

which works because zip loops through df.name and df.grade_2020 at the same time like a “zipper”:

  1. df = grades_2020()
  2. collect(zip(df.name, df.grade_2020))
  1. ("Sally", 1.0)
  1. ("Bob", 5.0)
  1. ("Alice", 8.5)
  1. ("Hank", 4.0)

However, converting a DataFrame to a Dict is only useful when the elements are unique. Generally that is not the case and that’s why we need to learn how to filter a DataFrame.

4.2 Index and Summarize - 图1 Support this project
CC BY-NC-SA 4.0 Jose Storopoli, Rik Huijzer, Lazaro Alonso