1.5 Scipy：高级科学计算 - 1.5.6. 统计和随机数：scipy.stats - 《SciPy Lecture Notes 中文版（Python科学计算生态的介绍的中文翻译）》

1.5.6. 统计和随机数：scipy.stats

scipy.stats模块包含统计工具和随机过程的概率描述。在numpy.random中可以找到多个随机数生成器。

给定随机过程的观察值，它们的直方图是随机过程的PDF（概率密度函数）的估计值：

In [31]:

a = np.random.normal(size=1000)
bins = np.arange(-4, 5)
bins

Out[31]:

array([-4, -3, -2, -1,  0,  1,  2,  3,  4])

In [32]:

histogram = np.histogram(a, bins=bins, normed=True)[0]
bins = 0.5*(bins[1:] + bins[:-1])
bins

Out[32]:

array([-3.5, -2.5, -1.5, -0.5,  0.5,  1.5,  2.5,  3.5])

In [35]:

from scipy import stats
import pylab as pl
b = stats.norm.pdf(bins)  # norm 是一种分布
pl.plot(bins, histogram)
pl.plot(bins, b)

Out[35]:

[<matplotlib.lines.Line2D at 0x10764cd10>]

如果我们知道随机过程属于特定的随机过程家族，比如正态过程，我们可以做一个观察值的最大可能性拟合，来估计潜在分布的参数。这里我们用随机过程拟合观察数据：

In [5]:

loc, std = stats.norm.fit(a)
loc

Out[5]:

-0.063033073531050018

In [6]:

std

Out[6]:

0.97226620529973573

练习：概率分布

用shape参数为1的gamma分布生成1000个随机数，然后绘制那些样本的直方图。你可以在顶部绘制pdf（应该会匹配）吗？

额外信息：这些分布都有一些有用的方法。读一下文档字符串或者用IPython tab 完成来研究这些方法。你可以用在你的随机变量上使用fit方法来找回shape参数1吗？

中数是有一半值在其上一半值在其下的值：

In [7]:

np.median(a)

Out[7]:

-0.061271835457024623

中数也被称为百分位数50，因为50%的观察值在它之下：

In [8]:

stats.scoreatpercentile(a, 50)

Out[8]:

-0.061271835457024623

同样，我们也能计算百分位数90：

In [10]:

stats.scoreatpercentile(a, 90)

Out[10]:

1.1746952490791494

百分位数是CDF的估计值：累积分布函数。

统计检验是一个决策指示器。例如，如果我们有两组观察值，我们假设他们来自于高斯过程，我们可以用T检验来决定这两组观察值是不是显著不同：

In [11]:

a = np.random.normal(0, 1, size=100)
b = np.random.normal(1, 1, size=10)
stats.ttest_ind(a, b)

Out[11]:

(-2.8365663431591557, 0.0054465620169369703)

生成的结果由以下内容组成：