计算缺少的数据
- Sum/Prod of Empties/Nans
- NA values in GroupBy

计算缺少的数据

Missing values propagate naturally through arithmetic operations between pandas objects.

In [27]: a
Out[27]: 
        one       two
a       NaN  0.501113
c       NaN  0.580967
e  0.057802  0.761948
f -0.443160 -0.974602
h -0.443160 -1.053898
In [28]: b
Out[28]: 
        one       two     three
a       NaN  0.501113 -0.355322
c       NaN  0.580967  0.983801
e  0.057802  0.761948 -0.712964
f -0.443160 -0.974602  1.047704
h       NaN -1.053898 -0.019369
In [29]: a + b
Out[29]: 
        one  three       two
a       NaN    NaN  1.002226
c       NaN    NaN  1.161935
e  0.115604    NaN  1.523896
f -0.886321    NaN -1.949205
h       NaN    NaN -2.107796

The descriptive statistics and computational methods discussed in the data structure overview (and listed here and here) are all written to account for missing data. For example:

When summing data, NA (missing) values will be treated as zero.
If the data are all NA, the result will be 0.
Cumulative methods like cumsum() and cumprod() ignore NA values by default, but preserve them in the resulting arrays. To override this behaviour and include NA values, use skipna=False.

In [30]: df
Out[30]: 
        one       two     three
a       NaN  0.501113 -0.355322
c       NaN  0.580967  0.983801
e  0.057802  0.761948 -0.712964
f -0.443160 -0.974602  1.047704
h       NaN -1.053898 -0.019369
In [31]: df['one'].sum()
Out[31]: -0.38535826528461409
In [32]: df.mean(1)
Out[32]: 
a    0.072895
c    0.782384
e    0.035595
f   -0.123353
h   -0.536633
dtype: float64
In [33]: df.cumsum()
Out[33]: 
        one       two     three
a       NaN  0.501113 -0.355322
c       NaN  1.082080  0.628479
e  0.057802  1.844028 -0.084485
f -0.385358  0.869426  0.963219
h       NaN -0.184472  0.943850
In [34]: df.cumsum(skipna=False)
Out[34]: 
   one       two     three
a  NaN  0.501113 -0.355322
c  NaN  1.082080  0.628479
e  NaN  1.844028 -0.084485
f  NaN  0.869426  0.963219
h  NaN -0.184472  0.943850

Sum/Prod of Empties/Nans

警告

This behavior is now standard as of v0.22.0 and is consistent with the default in numpy; previously sum/prod of all-NA or empty Series/DataFrames would return NaN. See v0.22.0 whatsnew for more.

The sum of an empty or all-NA Series or column of a DataFrame is 0.

In [35]: pd.Series([np.nan]).sum()
Out[35]: 0.0
In [36]: pd.Series([]).sum()
Out[36]: 0.0

The product of an empty or all-NA Series or column of a DataFrame is 1.

In [37]: pd.Series([np.nan]).prod()
Out[37]: 1.0
In [38]: pd.Series([]).prod()
Out[38]: 1.0

NA values in GroupBy

NA groups in GroupBy are automatically excluded. This behavior is consistent with R, for example:

In [39]: df
Out[39]: 
        one       two     three
a       NaN  0.501113 -0.355322
c       NaN  0.580967  0.983801
e  0.057802  0.761948 -0.712964
f -0.443160 -0.974602  1.047704
h       NaN -1.053898 -0.019369
In [40]: df.groupby('one').mean()
Out[40]: 
                two     three
one                          
-0.443160 -0.974602  1.047704
 0.057802  0.761948 -0.712964

See the groupby section here for more information.