6. 高亮每列的最大值

  1. In[61]: pd.options.display.max_rows = 8
  2. # 读取college数据集,INSTNM作为列
  3. In[62]: college = pd.read_csv('data/college.csv', index_col='INSTNM')
  4. college.dtypes
  5. Out[62]: CITY object
  6. STABBR object
  7. HBCU float64
  8. MENONLY float64
  9. ...
  10. PCTFLOAN float64
  11. UG25ABV float64
  12. MD_EARN_WNE_P10 object
  13. GRAD_DEBT_MDN_SUPP object
  14. Length: 26, dtype: object
  1. # MD_EARN_WNE_P10 和 GRAD_DEBT_MDN_SUPP 两列是对象类型,对其进行检查,发现含有字符串
  2. In[63]: college.MD_EARN_WNE_P10.iloc[0]
  3. Out[63]: '30300'
  4. In[64]: college.MD_EARN_WNE_P10.iloc[0]
  5. Out[64]: '30300'
  1. # 降序检查
  2. In[65]: college.MD_EARN_WNE_P10.sort_values(ascending=False).head()
  3. Out[65]: INSTNM
  4. Sharon Regional Health System School of Nursing PrivacySuppressed
  5. Northcoast Medical Training Academy PrivacySuppressed
  6. Success Schools PrivacySuppressed
  7. Louisiana Culinary Institute PrivacySuppressed
  8. Bais Medrash Toras Chesed PrivacySuppressed
  9. Name: MD_EARN_WNE_P10, dtype: object
  1. # 可以用to_numeric,将某列的值做强制转换
  2. In[66]: cols = ['MD_EARN_WNE_P10', 'GRAD_DEBT_MDN_SUPP']
  3. for col in cols:
  4. college[col] = pd.to_numeric(college[col], errors='coerce')
  5. college.dtypes.loc[cols]
  6. Out[66]: MD_EARN_WNE_P10 float64
  7. GRAD_DEBT_MDN_SUPP float64
  8. dtype: object
  1. # 用select_dtypes方法过滤出数值列
  2. In[67]: college_n = college.select_dtypes(include=[np.number])
  3. college_n.head()
  4. Out[67]:

6. 高亮每列的最大值 - 图1

  1. # 有的列只含有两个值,用nunique()方法挑出这些列
  2. In[68]: criteria = college_n.nunique() == 2
  3. criteria.head()
  4. Out[68]: HBCU True
  5. MENONLY True
  6. WOMENONLY True
  7. RELAFFIL True
  8. SATVRMID False
  9. dtype: bool
  1. # 将布尔Series传给索引运算符,生成二元列的列表
  2. In[69]: binary_cols = college_n.columns[criteria].tolist()
  3. binary_cols
  4. Out[69]: ['HBCU', 'MENONLY', 'WOMENONLY', 'RELAFFIL', 'DISTANCEONLY', 'CURROPER']
  1. # 用drop方法删除这些列
  2. In[70]: college_n2 = college_n.drop(labels=binary_cols, axis='columns')
  3. college_n2.head()
  4. Out[70]:

6. 高亮每列的最大值 - 图2

  1. # 用idxmax方法选出每列最大值的行索引标签
  2. In[71]: max_cols = college_n2.idxmax()
  3. max_cols
  4. Out[71]: SATVRMID California Institute of Technology
  5. SATMTMID California Institute of Technology
  6. UGDS University of Phoenix-Arizona
  7. UGDS_WHITE Mr Leon's School of Hair Design-Moscow
  8. ...
  9. PCTFLOAN ABC Beauty College Inc
  10. UG25ABV Dongguk University-Los Angeles
  11. MD_EARN_WNE_P10 Medical College of Wisconsin
  12. GRAD_DEBT_MDN_SUPP Southwest University of Visual Arts-Tucson
  13. Length: 18, dtype: object
  1. # 用unique()方法选出所有不重复的列名
  2. In[72]: unique_max_cols = max_cols.unique()
  3. unique_max_cols[:5]
  4. Out[72]: array(['California Institute of Technology',
  5. 'University of Phoenix-Arizona',
  6. "Mr Leon's School of Hair Design-Moscow",
  7. 'Velvatex College of Beauty Culture',
  8. 'Thunderbird School of Global Management'], dtype=object)
  1. # 用max_cols选出只包含最大值的行,用style的highlight_max()高亮
  2. In[73]: college_n2.loc[unique_max_cols].style.highlight_max()
  3. Out[73]:

6. 高亮每列的最大值 - 图3

更多

  1. # 用axis参数可以高亮每行的最大值
  2. In[74]: college = pd.read_csv('data/college.csv', index_col='INSTNM')
  3. college_ugds = college.filter(like='UGDS_').head()
  4. college_ugds.style.highlight_max(axis='columns')
  5. Out[74]:

6. 高亮每列的最大值 - 图4

  1. In[75]: pd.Timedelta(1, unit='Y')
  2. Out[75]: Timedelta('365 days 05:49:12')