十、 Categorical

从 0.15 版本开始,pandas 可以在DataFrame中支持 Categorical 类型的数据,详细 介绍参看:Categorical 简介API documentation

  1. In [127]: df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})

1、 将原始的grade转换为 Categorical 数据类型:

  1. In [128]: df["grade"] = df["raw_grade"].astype("category")
  2. In [129]: df["grade"]
  3. Out[129]:
  4. 0 a
  5. 1 b
  6. 2 b
  7. 3 a
  8. 4 a
  9. 5 e
  10. Name: grade, dtype: category
  11. Categories (3, object): [a, b, e]

2、 将 Categorical 类型数据重命名为更有意义的名称:

  1. In [130]: df["grade"].cat.categories = ["very good", "good", "very bad"]

3、 对类别进行重新排序,增加缺失的类别:

  1. In [131]: df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])
  2. In [132]: df["grade"]
  3. Out[132]:
  4. 0 very good
  5. 1 good
  6. 2 good
  7. 3 very good
  8. 4 very good
  9. 5 very bad
  10. Name: grade, dtype: category
  11. Categories (5, object): [very bad, bad, medium, good, very good]

4、 排序是按照 Categorical 的顺序进行的而不是按照字典顺序进行:

  1. In [133]: df.sort_values(by="grade")
  2. Out[133]:
  3. id raw_grade grade
  4. 5 6 e very bad
  5. 1 2 b good
  6. 2 3 b good
  7. 0 1 a very good
  8. 3 4 a very good
  9. 4 5 a very good

5、 对 Categorical 列进行排序时存在空的类别:

  1. In [134]: df.groupby("grade").size()
  2. Out[134]:
  3. grade
  4. very bad 1
  5. bad 0
  6. medium 0
  7. good 2
  8. very good 3
  9. dtype: int64