5. 分组聚合后unstacking

  1. # 读取employee数据集,求出每个种族的平均工资
  2. In[34]: employee = pd.read_csv('data/employee.csv')
  3. In[35]: employee.groupby('RACE')['BASE_SALARY'].mean().astype(int)
  4. out[35]: RACE
  5. American Indian or Alaskan Native 60272
  6. Asian/Pacific Islander 61660
  7. Black or African American 50137
  8. Hispanic/Latino 52345
  9. Others 51278
  10. White 64419
  11. Name: BASE_SALARY, dtype: int64
  1. # 对种族和性别分组,求平均工资
  2. In[36]: agg = employee.groupby(['RACE', 'GENDER'])['BASE_SALARY'].mean().astype(int)
  3. agg
  4. out[36]: RACE GENDER
  5. American Indian or Alaskan Native Female 60238
  6. Male 60305
  7. Asian/Pacific Islander Female 63226
  8. Male 61033
  9. Black or African American Female 48915
  10. Male 51082
  11. Hispanic/Latino Female 46503
  12. Male 54782
  13. Others Female 63785
  14. Male 38771
  15. White Female 66793
  16. Male 63940
  17. Name: BASE_SALARY, dtype: int64
  1. # 对索引层GENDER做unstack
  2. In[37]: agg.unstack('GENDER')
  3. out[37]:

5. 分组聚合后unstacking - 图1

  1. # 对索引层RACE做unstack
  2. In[38]: agg.unstack('RACE')
  3. out[38]:

5. 分组聚合后unstacking - 图2

更多

  1. # 按RACE和GENDER分组,求工资的平均值、最大值和最小值
  2. In[39]: agg2 = employee.groupby(['RACE', 'GENDER'])['BASE_SALARY'].agg(['mean', 'max', 'min']).astype(int)
  3. agg2
  4. out[39]:

5. 分组聚合后unstacking - 图3

  1. # 此时unstack('GENDER')会生成多级列索引,可以用stack和unstack调整结构
  2. agg2.unstack('GENDER')

5. 分组聚合后unstacking - 图4