8. 用Seaborn破解diamonds数据集的辛普森悖论

  1. In[95]: pd.DataFrame(index=['Student A', 'Student B'],
  2. data={'Raw Score': ['50/100', '80/100'],
  3. 'Percent Correct':[50,80]}, columns=['Raw Score', 'Percent Correct'])
  4. Out[95]:

8. 用Seaborn破解diamonds数据集的辛普森悖论 - 图1

  1. In[96]: pd.DataFrame(index=['Student A', 'Student B'],
  2. data={'Difficult': ['45/95', '2/5'],
  3. 'Easy': ['5/5', '78/95'],
  4. 'Difficult Percent': [47, 40],
  5. 'Easy Percent' : [100, 82],
  6. 'Total Percent':[50, 80]},
  7. columns=['Difficult', 'Easy', 'Difficult Percent', 'Easy Percent', 'Total Percent'])
  8. Out[96]:

8. 用Seaborn破解diamonds数据集的辛普森悖论 - 图2

  1. # 读取diamonds数据集
  2. In[97]: diamonds = pd.read_csv('data/diamonds.csv')
  3. diamonds.head()
  4. Out[97]:

8. 用Seaborn破解diamonds数据集的辛普森悖论 - 图3

  1. # 将cut、color、clarity列变为有序类型
  2. In[98]: cut_cats = ['Fair', 'Good', 'Very Good', 'Premium', 'Ideal']
  3. color_cats = ['J', 'I', 'H', 'G', 'F', 'E', 'D']
  4. clarity_cats = ['I1', 'SI2', 'SI1', 'VS2', 'VS1', 'VVS2', 'VVS1', 'IF']
  5. diamonds['cut'] = pd.Categorical(diamonds['cut'],
  6. categories=cut_cats,
  7. ordered=True)
  8. diamonds['color'] = pd.Categorical(diamonds['color'],
  9. categories=color_cats,
  10. ordered=True)
  11. diamonds['clarity'] = pd.Categorical(diamonds['clarity'],
  12. categories=clarity_cats,
  13. ordered=True)
  14. In[99]: fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(14,4))
  15. sns.barplot(x='color', y='price', data=diamonds, ax=ax1)
  16. sns.barplot(x='cut', y='price', data=diamonds, ax=ax2)
  17. sns.barplot(x='clarity', y='price', data=diamonds, ax=ax3)
  18. fig.suptitle('Price Decreasing with Increasing Quality?')
  19. Out[98]: Text(0.5,0.98,'Price Decreasing with Increasing Quality?')

8. 用Seaborn破解diamonds数据集的辛普森悖论 - 图4

  1. # 画出每种钻石颜色和价格的关系
  2. In[100]: sns.factorplot(x='color', y='price', col='clarity',
  3. col_wrap=4, data=diamonds, kind='bar')
  4. Out[100]: <seaborn.axisgrid.FacetGrid at 0x11b61d5f8>

8. 用Seaborn破解diamonds数据集的辛普森悖论 - 图5

  1. # 用克拉值取代价格
  2. In[101]: sns.factorplot(x='color', y='carat', col='clarity',
  3. col_wrap=4, data=diamonds, kind='bar')
  4. Out[101]: <seaborn.axisgrid.FacetGrid at 0x11e42eef0>

8. 用Seaborn破解diamonds数据集的辛普森悖论 - 图6

  1. In[102]: fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(14,4))
  2. sns.barplot(x='color', y='carat', data=diamonds, ax=ax1)
  3. sns.barplot(x='cut', y='carat', data=diamonds, ax=ax2)
  4. sns.barplot(x='clarity', y='carat', data=diamonds, ax=ax3)
  5. fig.suptitle('Diamond size decreases with quality')
  6. Out[102]: Text(0.5,0.98,'Diamond size decreases with quality')

8. 用Seaborn破解diamonds数据集的辛普森悖论 - 图7

  1. # 下图显示钻石越大,价格越高
  2. In[103]: diamonds['carat_category'] = pd.qcut(diamonds.carat, 5)
  3. from matplotlib.cm import Greys
  4. greys = Greys(np.arange(50,250,40))
  5. g = sns.factorplot(x='clarity', y='price', data=diamonds,
  6. hue='carat_category', col='color',
  7. col_wrap=4, kind='point') # , palette=greys)
  8. g.fig.suptitle('Diamond price by size, color and clarity',
  9. y=1.02, size=20)
  10. Out[103]: Text(0.5,1.02,'Diamond price by size, color and clarity')

8. 用Seaborn破解diamonds数据集的辛普森悖论 - 图8

更多

  1. # 用seaborn更高级的PairGrid构造器,对二元变量作图
  2. In[104]: g = sns.PairGrid(diamonds,size=5,
  3. x_vars=["color", "cut", "clarity"],
  4. y_vars=["price"])
  5. g.map(sns.barplot)
  6. g.fig.suptitle('Replication of Step 3 with PairGrid', y=1.02)
  7. Out[104]: Text(0.5,1.02,'Replication of Step 3 with PairGrid')

8. 用Seaborn破解diamonds数据集的辛普森悖论 - 图9