Datasets

機器學習資料集/ 範例三: The iris dataset

http://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html

這個範例目的是介紹機器學習範例資料集中的iris 鳶尾花資料集

(一)引入函式庫及內建手寫數字資料庫

  1. #這行是在ipython notebook的介面裏專用,如果在其他介面則可以拿掉
  2. %matplotlib inline
  3. import matplotlib.pyplot as plt
  4. from mpl_toolkits.mplot3d import Axes3D
  5. from sklearn import datasets
  6. from sklearn.decomposition import PCA
  7. # import some data to play with
  8. iris = datasets.load_iris()
  9. X = iris.data[:, :2] # we only take the first two features.
  10. Y = iris.target
  11. x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
  12. y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
  13. plt.figure(2, figsize=(8, 6))
  14. plt.clf()
  15. # Plot the training points
  16. plt.scatter(X[:, 0], X[:, 1], c=Y, cmap=plt.cm.Paired)
  17. plt.xlabel('Sepal length')
  18. plt.ylabel('Sepal width')
  19. plt.xlim(x_min, x_max)
  20. plt.ylim(y_min, y_max)
  21. plt.xticks(())
  22. plt.yticks(())

png

(二)資料集介紹

iris = datasets.load_iris() 將一個dict型別資料存入iris,我們可以用下面程式碼來觀察裏面資料

  1. for key,value in iris.items() :
  2. try:
  3. print (key,value.shape)
  4. except:
  5. print (key)
  6. print(iris['feature_names'])
顯示 說明
(‘target_names’, (3L,)) 共有三種鳶尾花 setosa, versicolor, virginica
(‘data’, (150L, 4L)) 有150筆資料,共四種特徵
(‘target’, (150L,)) 這150筆資料各是那一種鳶尾花
DESCR 資料之描述
feature_names 四個特徵代表的意義,分別為 萼片(sepal)之長與寬以及花瓣(petal)之長與寬

為了用視覺化方式呈現這個資料集,下面程式碼首先使用PCA演算法將資料維度降低至3

  1. X_reduced = PCA(n_components=3).fit_transform(iris.data)

接下來將三個維度的資料立用mpl_toolkits.mplot3d.Axes3D 建立三維繪圖空間,並利用 scatter以三個特徵資料數值當成座標繪入空間,並以三種iris之數值 Y,來指定資料點的顏色。我們可以看出三種iris中,有一種明顯的可以與其他兩種區別,而另外兩種則無法明顯區別。

  1. # To getter a better understanding of interaction of the dimensions
  2. # plot the first three PCA dimensions
  3. fig = plt.figure(1, figsize=(8, 6))
  4. ax = Axes3D(fig, elev=-150, azim=110)
  5. ax.scatter(X_reduced[:, 0], X_reduced[:, 1], X_reduced[:, 2], c=Y,
  6. cmap=plt.cm.Paired)
  7. ax.set_title("First three PCA directions")
  8. ax.set_xlabel("1st eigenvector")
  9. ax.w_xaxis.set_ticklabels([])
  10. ax.set_ylabel("2nd eigenvector")
  11. ax.w_yaxis.set_ticklabels([])
  12. ax.set_zlabel("3rd eigenvector")
  13. ax.w_zaxis.set_ticklabels([])
  14. plt.show()

png

  1. #接著我們嘗試將這個機器學習資料之描述檔顯示出來
  2. print(iris['DESCR'])
  1. Iris Plants Database
  2. Notes
  3. -----
  4. Data Set Characteristics:
  5. :Number of Instances: 150 (50 in each of three classes)
  6. :Number of Attributes: 4 numeric, predictive attributes and the class
  7. :Attribute Information:
  8. - sepal length in cm
  9. - sepal width in cm
  10. - petal length in cm
  11. - petal width in cm
  12. - class:
  13. - Iris-Setosa
  14. - Iris-Versicolour
  15. - Iris-Virginica
  16. :Summary Statistics:
  17. ============== ==== ==== ======= ===== ====================
  18. Min Max Mean SD Class Correlation
  19. ============== ==== ==== ======= ===== ====================
  20. sepal length: 4.3 7.9 5.84 0.83 0.7826
  21. sepal width: 2.0 4.4 3.05 0.43 -0.4194
  22. petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)
  23. petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)
  24. ============== ==== ==== ======= ===== ====================
  25. :Missing Attribute Values: None
  26. :Class Distribution: 33.3% for each of 3 classes.
  27. :Creator: R.A. Fisher
  28. :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
  29. :Date: July, 1988
  30. This is a copy of UCI ML iris datasets.
  31. http://archive.ics.uci.edu/ml/datasets/Iris
  32. The famous Iris database, first used by Sir R.A Fisher
  33. This is perhaps the best known database to be found in the
  34. pattern recognition literature. Fisher's paper is a classic in the field and
  35. is referenced frequently to this day. (See Duda & Hart, for example.) The
  36. data set contains 3 classes of 50 instances each, where each class refers to a
  37. type of iris plant. One class is linearly separable from the other 2; the
  38. latter are NOT linearly separable from each other.
  39. References
  40. ----------
  41. - Fisher,R.A. "The use of multiple measurements in taxonomic problems"
  42. Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
  43. Mathematical Statistics" (John Wiley, NY, 1950).
  44. - Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis.
  45. (Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.
  46. - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
  47. Structure and Classification Rule for Recognition in Partially Exposed
  48. Environments". IEEE Transactions on Pattern Analysis and Machine
  49. Intelligence, Vol. PAMI-2, No. 1, 67-71.
  50. - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactions
  51. on Information Theory, May 1972, 431-433.
  52. - See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II
  53. conceptual clustering system finds 3 classes in the data.
  54. - Many, many more ...

這個描述檔說明了這個資料集是在 1936年時由Fisher建立,為圖形識別領域之重要經典範例。共例用四種特徵來分類三種鳶尾花

(三)應用範例介紹

在整個scikit-learn應用範例中,有以下幾個範例是利用了這組iris資料集。