1. 规划数据分析路线

  1. # 读取查看数据
  2. In[2]: college = pd.read_csv('data/college.csv')
  3. In[3]: college.head()
  4. Out[3]:

1. 规划数据分析路线 - 图1

  1. # 数据的行数与列数
  2. In[4]: college.shape
  3. Out[4]: (7535, 27)
  1. # 统计数值列,并进行转置
  2. In[5]: with pd.option_context('display.max_rows', 8):
  3. display(college.describe(include=[np.number]).T)
  4. Out[5]:

1. 规划数据分析路线 - 图2

  1. # 统计对象和类型列
  2. In[6]: college.describe(include=[np.object, pd.Categorical]).T
  3. Out[6]:

1. 规划数据分析路线 - 图3

  1. # 列出每列的数据类型,非缺失值的数量,以及内存的使用
  2. In[7]: college.info()
  3. <class 'pandas.core.frame.DataFrame'>
  4. RangeIndex: 7535 entries, 0 to 7534
  5. Data columns (total 27 columns):
  6. INSTNM 7535 non-null object
  7. CITY 7535 non-null object
  8. STABBR 7535 non-null object
  9. HBCU 7164 non-null float64
  10. MENONLY 7164 non-null float64
  11. WOMENONLY 7164 non-null float64
  12. RELAFFIL 7535 non-null int64
  13. SATVRMID 1185 non-null float64
  14. SATMTMID 1196 non-null float64
  15. DISTANCEONLY 7164 non-null float64
  16. UGDS 6874 non-null float64
  17. UGDS_WHITE 6874 non-null float64
  18. UGDS_BLACK 6874 non-null float64
  19. UGDS_HISP 6874 non-null float64
  20. UGDS_ASIAN 6874 non-null float64
  21. UGDS_AIAN 6874 non-null float64
  22. UGDS_NHPI 6874 non-null float64
  23. UGDS_2MOR 6874 non-null float64
  24. UGDS_NRA 6874 non-null float64
  25. UGDS_UNKN 6874 non-null float64
  26. PPTUG_EF 6853 non-null float64
  27. CURROPER 7535 non-null int64
  28. PCTPELL 6849 non-null float64
  29. PCTFLOAN 6849 non-null float64
  30. UG25ABV 6718 non-null float64
  31. MD_EARN_WNE_P10 6413 non-null object
  32. GRAD_DEBT_MDN_SUPP 7503 non-null object
  33. dtypes: float64(20), int64(2), object(5)
  34. memory usage: 1.6+ MB
  1. # 重复了,但没设置最大行数
  2. In[8]: college.describe(include=[np.number]).T
  3. Out[8]:

1. 规划数据分析路线 - 图4

  1. # 和前面重复了
  2. In[9]: college.describe(include=[np.object, pd.Categorical]).T
  3. Out[9]:

1. 规划数据分析路线 - 图5

更多

  1. # 在describe方法中,打印分位数
  2. In[10]: with pd.option_context('display.max_rows', 5):
  3. display(college.describe(include=[np.number],
  4. percentiles=[.01, .05, .10, .25, .5, .75, .9, .95, .99]).T)

1. 规划数据分析路线 - 图6

  1. # 展示一个数据字典:数据字典的主要作用是解释列名的意义
  2. In[11]: college_dd = pd.read_csv('data/college_data_dictionary.csv')
  3. In[12]: with pd.option_context('display.max_rows', 8):
  4. display(college_dd)

1. 规划数据分析路线 - 图7