灵活的二进制操作

With binary operations between pandas data structures, there are two key points of interest:
对于pandas数据类型的二元操作,有以下两点值得注意:

  • Broadcasting behavior between higher- (e.g. DataFrame) and lower-dimensional (e.g. Series) objects.
    在高维(如,数据表)与低维(如,序列)对象间的广播行为
  • Missing data in computations.
    缺失值对于计算的影响
    We will demonstrate how to manage these issues independently, though they can be handled simultaneously.
    尽管我们可以同时处理,这里,我们将仅展示如何单独地解决这两个问题。

Matching / broadcasting behavior

匹配/广播 行为

DataFrame has the methods add(), sub(), mul(), div() and related functions radd(), rsub(), … for carrying out binary operations. For broadcasting behavior, Series input is of primary interest. Using these functions, you can use to either match on the index or columns via the axis keyword:
数据表拥有 add(), sub(), mul(), div() 方法,以及相关的radd(), rsub(), …函数,用于执行二元操作。对于广播行为,序列输入是最重要的。使用这些函数,你可以通过关键字axixs在索引或着列上进行匹配:

  1. In [14]: df = pd.DataFrame({'one' : pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
  2. ....: 'two' : pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),
  3. ....: 'three' : pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})
  4. ....:
  5. In [15]: df
  6. Out[15]:
  7. one two three
  8. a -1.101558 1.124472 NaN
  9. b -0.177289 2.487104 -0.634293
  10. c 0.462215 -0.486066 1.931194
  11. d NaN -0.456288 -1.222918
  12. In [16]: row = df.iloc[1]
  13. In [17]: column = df['two']
  14. In [18]: df.sub(row, axis='columns')
  15. Out[18]:
  16. one two three
  17. a -0.924269 -1.362632 NaN
  18. b 0.000000 0.000000 0.000000
  19. c 0.639504 -2.973170 2.565487
  20. d NaN -2.943392 -0.588625
  21. In [19]: df.sub(row, axis=1)
  22. Out[19]:
  23. one two three
  24. a -0.924269 -1.362632 NaN
  25. b 0.000000 0.000000 0.000000
  26. c 0.639504 -2.973170 2.565487
  27. d NaN -2.943392 -0.588625
  28. In [20]: df.sub(column, axis='index')
  29. Out[20]:
  30. one two three
  31. a -2.226031 0.0 NaN
  32. b -2.664393 0.0 -3.121397
  33. c 0.948280 0.0 2.417260
  34. d NaN 0.0 -0.766631
  35. In [21]: df.sub(column, axis=0)
  36. Out[21]:
  37. one two three
  38. a -2.226031 0.0 NaN
  39. b -2.664393 0.0 -3.121397
  40. c 0.948280 0.0 2.417260
  41. d NaN 0.0 -0.766631

Furthermore you can align a level of a multi-indexed DataFrame with a Series.
进一步,你可以将序列与数据表的多级索引中的一级对齐。

  1. In [22]: dfmi = df.copy()
  2. In [23]: dfmi.index = pd.MultiIndex.from_tuples([(1,'a'),(1,'b'),(1,'c'),(2,'a')],
  3. ....: names=['first','second'])
  4. ....:
  5. In [24]: dfmi.sub(column, axis=0, level='second')
  6. Out[24]:
  7. one two three
  8. first second
  9. 1 a -2.226031 0.00000 NaN
  10. b -2.664393 0.00000 -3.121397
  11. c 0.948280 0.00000 2.417260
  12. 2 a NaN -1.58076 -2.347391

With Panel, describing the matching behavior is a bit more difficult, so the arithmetic methods instead (and perhaps confusingly?) give you the option to specify the broadcast axis. For example, suppose we wished to demean the data over a particular axis. This can be accomplished by taking the mean over an axis and broadcasting over the same axis:
当使用面板时,描述匹配方法有些复杂,因此替代的数学方法讲给你选择广播的维度。例如,我们想要描述延某一特定维度的数据。这可以通过延某一维度广播并计算均值的方法达成:

  1. In [25]: major_mean = wp.mean(axis='major')
  2. In [26]: major_mean
  3. Out[26]:
  4. Item1 Item2
  5. A -0.878036 -0.092218
  6. B -0.060128 0.529811
  7. C 0.099453 -0.715139
  8. D 0.248599 -0.186535
  9. In [27]: wp.sub(major_mean, axis='major')
  10. Out[27]:
  11. <class 'pandas.core.panel.Panel'>
  12. Dimensions: 2 (items) x 5 (major_axis) x 4 (minor_axis)
  13. Items axis: Item1 to Item2
  14. Major_axis axis: 2000-01-01 00:00:00 to 2000-01-05 00:00:00
  15. Minor_axis axis: A to D

And similarly for axis="items" and axis="minor".
对于axis="items"axis="minor"也是类似的。

Note:I could be convinced to make the axis argument in the DataFrame methods match the broadcasting behavior of Panel. Though it would require a transition period so users can change their code…
注意:尽管这样将需要为我们的用户提供一个过度期,来让他们改写他们的代码,我仍然决定在数据表方法中提供一个axis参数,用于匹配类似面板中的广播行为

Series and Index also support the divmod() builtin. This function takes the floor division and modulo operation at the same time returning a two-tuple of the same type as the left hand side. For example:
序列与索引也原生支持 divmod() 。这个函数同时计算商与余数,并返回一个二元元组

  1. In [28]: s = pd.Series(np.arange(10))
  2. In [29]: s
  3. Out[29]:
  4. 0 0
  5. 1 1
  6. 2 2
  7. 3 3
  8. 4 4
  9. 5 5
  10. 6 6
  11. 7 7
  12. 8 8
  13. 9 9
  14. dtype: int64
  15. In [30]: div, rem = divmod(s, 3)
  16. In [31]: div
  17. Out[31]:
  18. 0 0
  19. 1 0
  20. 2 0
  21. 3 1
  22. 4 1
  23. 5 1
  24. 6 2
  25. 7 2
  26. 8 2
  27. 9 3
  28. dtype: int64
  29. In [32]: rem
  30. Out[32]:
  31. 0 0
  32. 1 1
  33. 2 2
  34. 3 0
  35. 4 1
  36. 5 2
  37. 6 0
  38. 7 1
  39. 8 2
  40. 9 0
  41. dtype: int64
  42. In [33]: idx = pd.Index(np.arange(10))
  43. In [34]: idx
  44. Out[34]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')
  45. In [35]: div, rem = divmod(idx, 3)
  46. In [36]: div
  47. Out[36]: Int64Index([0, 0, 0, 1, 1, 1, 2, 2, 2, 3], dtype='int64')
  48. In [37]: rem
  49. Out[37]: Int64Index([0, 1, 2, 0, 1, 2, 0, 1, 2, 0], dtype='int64')

We can also do elementwise divmod():
我们也可以在元素级别使用divmod():

  1. In [38]: div, rem = divmod(s, [2, 2, 3, 3, 4, 4, 5, 5, 6, 6])
  2. In [39]: div
  3. Out[39]:
  4. 0 0
  5. 1 0
  6. 2 0
  7. 3 1
  8. 4 1
  9. 5 1
  10. 6 1
  11. 7 1
  12. 8 1
  13. 9 1
  14. dtype: int64
  15. In [40]: rem
  16. Out[40]:
  17. 0 0
  18. 1 1
  19. 2 2
  20. 3 0
  21. 4 0
  22. 5 1
  23. 6 1
  24. 7 2
  25. 8 2
  26. 9 3
  27. dtype: int64

Missing data / operations with fill values

缺失值/补全计算

In Series and DataFrame, the arithmetic functions have the option of inputting a fill_value, namely a value to substitute when at most one of the values at a location are missing. For example, when adding two DataFrame objects, you may wish to treat NaN as 0 unless both DataFrames are missing that value, in which case the result will be NaN (you can later replace NaN with some other value using fillna if you wish).
在序列与数据表中,运算函数包含一个fill_value参数,它可以在碰到至多1个缺失值的时候替换该缺失值。例如,当进行两个数据表相加时,你有可能希望所有的NaN被作为0处理,除非两个数据表同时缺失这个值时,结果才是NaN(你可以在之后使用fillna方法来将他们替换为其他值)

  1. In [41]: df
  2. Out[41]:
  3. one two three
  4. a -1.101558 1.124472 NaN
  5. b -0.177289 2.487104 -0.634293
  6. c 0.462215 -0.486066 1.931194
  7. d NaN -0.456288 -1.222918
  8. In [42]: df2
  9. Out[42]:
  10. one two three
  11. a -1.101558 1.124472 1.000000
  12. b -0.177289 2.487104 -0.634293
  13. c 0.462215 -0.486066 1.931194
  14. d NaN -0.456288 -1.222918
  15. In [43]: df + df2
  16. Out[43]:
  17. one two three
  18. a -2.203116 2.248945 NaN
  19. b -0.354579 4.974208 -1.268586
  20. c 0.924429 -0.972131 3.862388
  21. d NaN -0.912575 -2.445837
  22. In [44]: df.add(df2, fill_value=0)
  23. Out[44]:
  24. one two three
  25. a -2.203116 2.248945 1.000000
  26. b -0.354579 4.974208 -1.268586
  27. c 0.924429 -0.972131 3.862388
  28. d NaN -0.912575 -2.445837

Flexible Comparisons

灵活比较

Series and DataFrame have the binary comparison methods eq, ne, lt, gt, le, and ge whose behavior is analogous to the binary arithmetic operations described above:
序列和数据表都包含 eq, ne, lt, gt, le, 和 ge 二元比较。他们的行为与算术的二元比较操作是类似的:

  1. In [45]: df.gt(df2)
  2. Out[45]:
  3. one two three
  4. a False False False
  5. b False False False
  6. c False False False
  7. d False False False
  8. In [46]: df2.ne(df)
  9. Out[46]:
  10. one two three
  11. a False False True
  12. b False False False
  13. c False False False
  14. d True False False

These operations produce a pandas object of the same type as the left-hand-side input that is of dtype bool. These boolean objects can be used in indexing operations, see the section on Boolean indexing.
这些操作返回一个具有与左操作数相同类型的pandas对象,即布尔型。这些布尔型对象可以在索引操作时使用,请参见: Boolean indexing

Boolean Reductions

布尔降维

You can apply the reductions: empty, any(), all(), and bool() to provide a way to summarize a boolean result.
你可以使用empty, any(), all(), 和 bool() 函数来对你的数据进行布尔降为,并得到一个布尔型的结果

  1. In [47]: (df > 0).all()
  2. Out[47]:
  3. one False
  4. two False
  5. three False
  6. dtype: bool
  7. In [48]: (df > 0).any()
  8. Out[48]:
  9. one True
  10. two True
  11. three True
  12. dtype: bool

You can reduce to a final boolean value.
你可以对结果继续进行降维

  1. In [49]: (df > 0).any().any()
  2. Out[49]: True

You can test if a pandas object is empty, via the empty property.
你可以通过empty属性来测试是否一个pandas对象为空

  1. In [50]: df.empty
  2. Out[50]: False
  3. In [51]: pd.DataFrame(columns=list('ABC')).empty
  4. Out[51]: True

To evaluate single-element pandas objects in a boolean context, use the method bool():
使用bool()方法来计算一个单元素布尔型的pandas对象的布尔属性

  1. In [52]: pd.Series([True]).bool()
  2. Out[52]: True
  3. In [53]: pd.Series([False]).bool()
  4. Out[53]: False
  5. In [54]: pd.DataFrame([[True]]).bool()
  6. Out[54]: True
  7. In [55]: pd.DataFrame([[False]]).bool()
  8. Out[55]: False

!Warning

!警告

You might be tempted to do the following:
你或许试图尝试以下的操作:

  1. >>> if df:
  2. ...

Or
或者

  1. >>> df and df2

These will both raise errors, as you are trying to compare multiple values.
这回触发一个错误,因为你再尝试比较多个值

  1. ValueError: The truth value of an array is ambiguous. Use a.empty, a.any() or a.all().

See gotchas for a more detailed discussion.
更多讨论,参见gotchas

Comparing if objects are equivalent

比较是否对象相等

Often you may find that there is more than one way to compute the same result. As a simple example, consider df+df and df*2. To test that these two computations produce the same result, given the tools shown above, you might imagine using (df+df == df*2).all(). But in fact, this expression is False:
通常我们有多种方法来计算同一个结果。例如:df+dfdf*2。为了比较两个对象是否相等,或许有些人希望使用(df+df == df*2).all().这样的语句,但事实上,这种语句是错误的

  1. In [56]: df+df == df*2
  2. Out[56]:
  3. one two three
  4. a True True False
  5. b True True True
  6. c True True True
  7. d False True True
  8. In [57]: (df+df == df*2).all()
  9. Out[57]:
  10. one False
  11. two True
  12. three False
  13. dtype: bool

Notice that the boolean DataFrame df+df == df2 contains some False values! This is because NaNs do not compare as equals:
注意,这两个布尔型的数据表 df+df == df
2 包含一些 False 值!这是因为NaN并不被人为是相等的。

  1. In [58]: np.nan == np.nan
  2. Out[58]: False

So, NDFrames (such as Series, DataFrames, and Panels) have an equals() method for testing equality, with NaNs in corresponding locations treated as equal.
因此,NDFrames(如,序列,数据表与面板)拥有一个equal()方法来进行“相等”的测试,此时两个位置相同的NaN被认为是相等的

  1. In [59]: (df+df).equals(df*2)
  2. Out[59]: True

Note that the Series or DataFrame index needs to be in the same order for equality to be True:
注意,序列和数据表的所索引需要是相同的顺序,才能在“相等”测试中 获得True的返回

  1. In [60]: df1 = pd.DataFrame({'col':['foo', 0, np.nan]})
  2. In [61]: df2 = pd.DataFrame({'col':[np.nan, 0, 'foo']}, index=[2,1,0])
  3. In [62]: df1.equals(df2)
  4. Out[62]: False
  5. In [63]: df1.equals(df2.sort_index())
  6. Out[63]: True

Comparing array-like objects

比较数组型的对象

You can conveniently perform element-wise comparisons when comparing a pandas data structure with a scalar value:
当使用标量类型的pandas数据结构时,你可以轻易地执行元素对元素的比较:

  1. In [64]: pd.Series(['foo', 'bar', 'baz']) == 'foo'
  2. Out[64]:
  3. 0 True
  4. 1 False
  5. 2 False
  6. dtype: bool
  7. In [65]: pd.Index(['foo', 'bar', 'baz']) == 'foo'
  8. Out[65]: array([ True, False, False], dtype=bool)

Pandas also handles element-wise comparisons between different array-like objects of the same length:
pandas可以处理不同类型,但长度相同的数组型对象

  1. In [66]: pd.Series(['foo', 'bar', 'baz']) == pd.Index(['foo', 'bar', 'qux'])
  2. Out[66]:
  3. 0 True
  4. 1 True
  5. 2 False
  6. dtype: bool
  7. In [67]: pd.Series(['foo', 'bar', 'baz']) == np.array(['foo', 'bar', 'qux'])
  8. Out[67]:
  9. 0 True
  10. 1 True
  11. 2 False
  12. dtype: bool

Trying to compare Index or Series objects of different lengths will raise a ValueError:
试图比较不同长度的IndexSeries对象将会引发错误

  1. In [55]: pd.Series(['foo', 'bar', 'baz']) == pd.Series(['foo', 'bar'])
  2. ValueError: Series lengths must match to compare
  3. In [56]: pd.Series(['foo', 'bar', 'baz']) == pd.Series(['foo'])
  4. ValueError: Series lengths must match to compare

Note that this is different from the NumPy behavior where a comparison can be broadcast:
注意,Pandas中并不可以进行比较。这不同于NumPy中比较可以被广播,

  1. In [68]: np.array([1, 2, 3]) == np.array([2])
  2. Out[68]: array([False, True, False], dtype=bool)

or it can return False if broadcasting can not be done:
或者在广播失败后返回False:

  1. In [69]: np.array([1, 2, 3]) == np.array([1, 2])
  2. Out[69]: False

Combining overlapping data sets

合并带有重复数据的数据集

A problem occasionally arising is the combination of two similar data sets where values in one are preferred over the other. An example would be two data series representing a particular economic indicator where one is considered to be of “higher quality”. However, the lower quality series might extend further back in history or have more complete data coverage. As such, we would like to combine two DataFrame objects where missing values in one DataFrame are conditionally filled with like-labeled values from the other DataFrame. The function implementing this operation is combine_first(), which we illustrate:
再合并两个相似的数据集时,一个常见的问题是,我们希望保留其中一个数据集中的数据,而舍弃另一个。一个例子便是我们有两个序列来表达一个特定的经济指标,然而其中的一个被认为是更“好”的。然而相对“不好”的数据集则包含更远古的数据,或者有这更大的数据覆盖。因此,我们将希望能够将两个数据表合并起来,并且将其中一个数据表中的缺失值,有条件地用另外一个数据表中的“相似标签”的数据来填充。完成此类操作的函数是combine_first(), 我们详述如下:

  1. In [70]: df1 = pd.DataFrame({'A' : [1., np.nan, 3., 5., np.nan],
  2. ....: 'B' : [np.nan, 2., 3., np.nan, 6.]})
  3. ....:
  4. In [71]: df2 = pd.DataFrame({'A' : [5., 2., 4., np.nan, 3., 7.],
  5. ....: 'B' : [np.nan, np.nan, 3., 4., 6., 8.]})
  6. ....:
  7. In [72]: df1
  8. Out[72]:
  9. A B
  10. 0 1.0 NaN
  11. 1 NaN 2.0
  12. 2 3.0 3.0
  13. 3 5.0 NaN
  14. 4 NaN 6.0
  15. In [73]: df2
  16. Out[73]:
  17. A B
  18. 0 5.0 NaN
  19. 1 2.0 NaN
  20. 2 4.0 3.0
  21. 3 NaN 4.0
  22. 4 3.0 6.0
  23. 5 7.0 8.0
  24. In [74]: df1.combine_first(df2)
  25. Out[74]:
  26. A B
  27. 0 1.0 NaN
  28. 1 2.0 2.0
  29. 2 3.0 3.0
  30. 3 5.0 4.0
  31. 4 3.0 6.0
  32. 5 7.0 8.0

General DataFrame Combine

一般性数据表合并

The combine_first() method above calls the more general DataFrame.combine(). This method takes another DataFrame and a combiner function, aligns the input DataFrame and then passes the combiner function pairs of Series (i.e., columns whose names are the same).
combine_first() 方法,调用了更一般化的DataFrame.combine(). 这个方法使用另外一个数据表和一个合并函数,对齐输入数据表,然后传入合并函数的序列对(即,列名相同)。

So, for instance, to reproduce combine_first() as above:
因此,例如,重现上述 combine_first() 函数:

  1. In [75]: combiner = lambda x, y: np.where(pd.isna(x), y, x)
  2. In [76]: df1.combine(df2, combiner)
  3. Out[76]:
  4. A B
  5. 0 1.0 NaN
  6. 1 2.0 2.0
  7. 2 3.0 3.0
  8. 3 5.0 4.0
  9. 4 3.0 6.0
  10. 5 7.0 8.0