3.1.3.1 用“公式” 来在Python中指定统计模型

3.1.3.1.1 简单线性回归

给定两组观察值,x和y, 我们想要检验假设y是x的线性函数,换句话说:

$y = x * coef + intercept + e$

其中$e$是观察噪音。我们将使用statmodels module:

  • 拟合一个线性模型。我们将使用简单的策略,普通最小二乘 (OLS)。
  • 检验系数是否是非0。

3.1.3.1 用“公式” 来在Python中指定统计模型 - 图1

首先,我们生成模型的虚拟数据:

In [9]:

  1. import numpy as np
  2. x = np.linspace(-5, 5, 20)
  3. np.random.seed(1)
  4. # normal distributed noise
  5. y = -5 + 3*x + 4 * np.random.normal(size=x.shape)
  6. # Create a data frame containing all the relevant variables
  7. data = pandas.DataFrame({'x': x, 'y': y})

Python中的统计公式

见statsmodels文档

然后我们指定一个OLS模型并且拟合它:

In [10]:

  1. from statsmodels.formula.api import ols
  2. model = ols("y ~ x", data).fit()

我们可以检查fit产生的各种统计量:

In [26]:

  1. print(model.summary())
  1. OLS Regression Results
  2.  ==============================================================================
  3. Dep. Variable: y R-squared: 0.804
  4. Model: OLS Adj. R-squared: 0.794
  5. Method: Least Squares F-statistic: 74.03
  6. Date: Wed, 18 Nov 2015 Prob (F-statistic): 8.56e-08
  7. Time: 17:10:03 Log-Likelihood: -57.988
  8. No. Observations: 20 AIC: 120.0
  9. Df Residuals: 18 BIC: 122.0
  10. Df Model: 1
  11. Covariance Type: nonrobust
  12.  ==============================================================================
  13. coef std err t P>|t| [95.0% Conf. Int.]
  14.  ------------- ------------- ------------- ------------- ------------- -------------
  15. Intercept -5.5335 1.036 -5.342 0.000 -7.710 -3.357
  16. x 2.9369 0.341 8.604 0.000 2.220 3.654
  17.  ==============================================================================
  18. Omnibus: 0.100 Durbin-Watson: 2.956
  19. Prob(Omnibus): 0.951 Jarque-Bera (JB): 0.322
  20. Skew: -0.058 Prob(JB): 0.851
  21. Kurtosis: 2.390 Cond. No. 3.03
  22.  ==============================================================================
  23. Warnings:
  24. [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

术语:

Statsmodels使用统计术语: statsmodel的y变量被称为‘endogenous’而x变量被称为exogenous。更详细的讨论见这里

为了简化,y (endogenous) 是你要尝试预测的值,而 x (exogenous) 代表用来进行这个预测的特征。

练习

从以上模型中取回估计参数。提示: 使用tab-完成来找到相关的属性。

3.1.3.1.2 类别变量: 比较组或多个类别

让我们回到大脑大小的数据:

In [27]:

  1. data = pandas.read_csv('examples/brain_size.csv', sep=';', na_values=".")

我们可以写一个比较,用线性模型比较男女IQ:

In [28]:

  1. model = ols("VIQ ~ Gender + 1", data).fit()
  2. print(model.summary())
  1. OLS Regression Results
  2.  ==============================================================================
  3. Dep. Variable: VIQ R-squared: 0.015
  4. Model: OLS Adj. R-squared: -0.010
  5. Method: Least Squares F-statistic: 0.5969
  6. Date: Wed, 18 Nov 2015 Prob (F-statistic): 0.445
  7. Time: 17:34:10 Log-Likelihood: -182.42
  8. No. Observations: 40 AIC: 368.8
  9. Df Residuals: 38 BIC: 372.2
  10. Df Model: 1
  11. Covariance Type: nonrobust
  12.  ==================================================================================
  13. coef std err t P>|t| [95.0% Conf. Int.]
  14.  ------------- ------------- ------------- ------------- ------------- -----------------
  15. Intercept 109.4500 5.308 20.619 0.000 98.704 120.196
  16. Gender[T.Male] 5.8000 7.507 0.773 0.445 -9.397 20.997
  17.  ==============================================================================
  18. Omnibus: 26.188 Durbin-Watson: 1.709
  19. Prob(Omnibus): 0.000 Jarque-Bera (JB): 3.703
  20. Skew: 0.010 Prob(JB): 0.157
  21. Kurtosis: 1.510 Cond. No. 2.62
  22.  ==============================================================================
  23. Warnings:
  24. [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

特定模型的提示

强制类别: ‘Gender’ 被自动识别为类别变量,因此,它的每一个不同值都被处理为不同的实体。 使用:

In [29]:

  1. model = ols('VIQ ~ C(Gender)', data).fit()

可以将一个整数列强制作为类别处理。

截距: 我们可以在公式中用-1删除截距,或者用+1强制使用截距。

默认,statsmodel将带有K和可能值的类别变量处理为K-1'虚拟变量' (最后一个水平被吸收到截距项中)。在绝大多数情况下,这都是很好的默认选择 - 但是,为类别变量指定不同的编码方式也是可以的 (http://statsmodels.sourceforge.net/devel/contrasts.html)。。)

FSIQ和PIQ差异的t-检验

要比较不同类型的IQ,我们需要创建一个"长形式"的表格,用一个类别变量来标识IQ类型:

In [30]:

  1. data_fisq = pandas.DataFrame({'iq': data['FSIQ'], 'type': 'fsiq'})
  2. data_piq = pandas.DataFrame({'iq': data['PIQ'], 'type': 'piq'})
  3. data_long = pandas.concat((data_fisq, data_piq))
  4. print(data_long)
  1. iq type
  2. 0 133 fsiq
  3. 1 140 fsiq
  4. 2 139 fsiq
  5. 3 133 fsiq
  6. 4 137 fsiq
  7. 5 99 fsiq
  8. 6 138 fsiq
  9. 7 92 fsiq
  10. 8 89 fsiq
  11. 9 133 fsiq
  12. 10 132 fsiq
  13. 11 141 fsiq
  14. 12 135 fsiq
  15. 13 140 fsiq
  16. 14 96 fsiq
  17. 15 83 fsiq
  18. 16 132 fsiq
  19. 17 100 fsiq
  20. 18 101 fsiq
  21. 19 80 fsiq
  22. 20 83 fsiq
  23. 21 97 fsiq
  24. 22 135 fsiq
  25. 23 139 fsiq
  26. 24 91 fsiq
  27. 25 141 fsiq
  28. 26 85 fsiq
  29. 27 103 fsiq
  30. 28 77 fsiq
  31. 29 130 fsiq
  32. .. ... ...
  33. 10 124 piq
  34. 11 128 piq
  35. 12 124 piq
  36. 13 147 piq
  37. 14 90 piq
  38. 15 96 piq
  39. 16 120 piq
  40. 17 102 piq
  41. 18 84 piq
  42. 19 86 piq
  43. 20 86 piq
  44. 21 84 piq
  45. 22 134 piq
  46. 23 128 piq
  47. 24 102 piq
  48. 25 131 piq
  49. 26 84 piq
  50. 27 110 piq
  51. 28 72 piq
  52. 29 124 piq
  53. 30 132 piq
  54. 31 137 piq
  55. 32 110 piq
  56. 33 86 piq
  57. 34 81 piq
  58. 35 128 piq
  59. 36 124 piq
  60. 37 94 piq
  61. 38 74 piq
  62. 39 89 piq
  63. [80 rows x 2 columns]

In [31]:

  1. model = ols("iq ~ type", data_long).fit()
  2. print(model.summary())
  1. OLS Regression Results
  2.  ==============================================================================
  3. Dep. Variable: iq R-squared: 0.003
  4. Model: OLS Adj. R-squared: -0.010
  5. Method: Least Squares F-statistic: 0.2168
  6. Date: Wed, 18 Nov 2015 Prob (F-statistic): 0.643
  7. Time: 18:16:40 Log-Likelihood: -364.35
  8. No. Observations: 80 AIC: 732.7
  9. Df Residuals: 78 BIC: 737.5
  10. Df Model: 1
  11. Covariance Type: nonrobust
  12.  ===============================================================================
  13. coef std err t P>|t| [95.0% Conf. Int.]
  14.  ------------- ------------- ------------- ------------- ------------- --------------
  15. Intercept 113.4500 3.683 30.807 0.000 106.119 120.781
  16. type[T.piq] -2.4250 5.208 -0.466 0.643 -12.793 7.943
  17.  ==============================================================================
  18. Omnibus: 164.598 Durbin-Watson: 1.531
  19. Prob(Omnibus): 0.000 Jarque-Bera (JB): 8.062
  20. Skew: -0.110 Prob(JB): 0.0178
  21. Kurtosis: 1.461 Cond. No. 2.62
  22.  ==============================================================================
  23. Warnings:
  24. [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

我们可以看到我们获得了与前面t-检验相同的值,以及相同的对应iq type的p-值:

In [32]:

  1. stats.ttest_ind(data['FSIQ'], data['PIQ'])

Out[32]:

  1. (0.46563759638096403, 0.64277250094148408)