特徵選擇/範例六: Univariate Feature Selection


此範例示範單變量特徵的選擇。鳶尾花資料中會加入數個雜訊特徵(不具影響力的特徵資訊)並且選擇單變量特徵。選擇過程會畫出每個特徵的 p-value 與其在支持向量機中的權重。可以從圖表中看出主要影響力特徵的選擇會選出具有主要影響力的特徵,並且這些特徵會在支持向量機有相當大的權重。

  1. 資料集:鳶尾花
  2. 特徵:萼片(sepal)之長與寬以及花瓣(petal)之長與寬
  3. 預測目標:共有三種鳶尾花 setosa, versicolor, virginica
  4. 機器學習方法:線性分類
  5. 探討重點:使用單變量選擇(SelectPercentile)挑出訓練特徵,與直接將所有訓練特徵輸入的分類器做比較
  6. 關鍵函式: sklearn.feature_selection.SelectPercentile



  1. # import some data to play with
  2. # The iris dataset
  3. iris = datasets.load_iris()
  4. # Some noisy data not correlated
  5. E = np.random.uniform(0, 0.1, size=(len(iris.data), 20))
  6. # Add the noisy data to the informative features
  7. X = np.hstack((iris.data, E))
  8. y = iris.target



  1. ###############################################################################
  2. # Univariate feature selection with F-test for feature scoring
  3. # We use the default selection function: the 10% most significant features
  4. selector = SelectPercentile(f_classif, percentile=10)
  5. selector.fit(X, y)
  6. scores = -np.log10(selector.pvalues_)
  7. scores /= scores.max()
  8. plt.bar(X_indices - .45, scores, width=.2,
  9. label=r'Univariate score ($-Log(p_{value})$)', color='g')



  1. ###############################################################################
  2. # Compare to the weights of an SVM
  3. clf = svm.SVC(kernel='linear')
  4. clf.fit(X, y)
  5. svm_weights = (clf.coef_ ** 2).sum(axis=0)
  6. svm_weights /= svm_weights.max()
  7. plt.bar(X_indices - .25, svm_weights, width=.2, label='SVM weight', color='r')



  1. clf_selected = svm.SVC(kernel='linear')
  2. clf_selected.fit(selector.transform(X), y)
  3. svm_weights_selected = (clf_selected.coef_ ** 2).sum(axis=0)
  4. svm_weights_selected /= svm_weights_selected.max()
  5. plt.bar(X_indices[selector.get_support()] - .05, svm_weights_selected,
  6. width=.2, label='SVM weights after selection', color='b')


Python source code: plot_feature_selection.py

  1. print(__doc__)
  2. import numpy as np
  3. import matplotlib.pyplot as plt
  4. from sklearn import datasets, svm
  5. from sklearn.feature_selection import SelectPercentile, f_classif
  6. ###############################################################################
  7. # import some data to play with
  8. # The iris dataset
  9. iris = datasets.load_iris()
  10. # Some noisy data not correlated
  11. E = np.random.uniform(0, 0.1, size=(len(iris.data), 20))
  12. # Add the noisy data to the informative features
  13. X = np.hstack((iris.data, E))
  14. y = iris.target
  15. ###############################################################################
  16. plt.figure(1)
  17. plt.clf()
  18. X_indices = np.arange(X.shape[-1])
  19. ###############################################################################
  20. # Univariate feature selection with F-test for feature scoring
  21. # We use the default selection function: the 10% most significant features
  22. selector = SelectPercentile(f_classif, percentile=10)
  23. selector.fit(X, y)
  24. scores = -np.log10(selector.pvalues_)
  25. scores /= scores.max()
  26. plt.bar(X_indices - .45, scores, width=.2,
  27. label=r'Univariate score ($-Log(p_{value})$)', color='g')
  28. ###############################################################################
  29. # Compare to the weights of an SVM
  30. clf = svm.SVC(kernel='linear')
  31. clf.fit(X, y)
  32. svm_weights = (clf.coef_ ** 2).sum(axis=0)
  33. svm_weights /= svm_weights.max()
  34. plt.bar(X_indices - .25, svm_weights, width=.2, label='SVM weight', color='r')
  35. clf_selected = svm.SVC(kernel='linear')
  36. clf_selected.fit(selector.transform(X), y)
  37. svm_weights_selected = (clf_selected.coef_ ** 2).sum(axis=0)
  38. svm_weights_selected /= svm_weights_selected.max()
  39. plt.bar(X_indices[selector.get_support()] - .05, svm_weights_selected,
  40. width=.2, label='SVM weights after selection', color='b')
  41. plt.title("Comparing feature selection")
  42. plt.xlabel('Feature number')
  43. plt.yticks(())
  44. plt.axis('tight')
  45. plt.legend(loc='upper right')
  46. plt.show()