函数应用

To apply your own or another library’s functions to pandas objects, you should be aware of the three methods below. The appropriate method to use depends on whether your function expects to operate on an entire DataFrame or Series, row- or column-wise, or elementwise.

  1. Tablewise Function Application: pipe()
  2. Row or Column-wise Function Application: apply()
  3. Aggregation API: agg() and transform()
  4. Applying Elementwise Functions: applymap()

Tablewise Function Application

DataFrames and Series can of course just be passed into functions. However, if the function needs to be called in a chain, consider using the pipe() method. Compare the following

  1. # f, g, and h are functions taking and returning ``DataFrames``
  2. >>> f(g(h(df), arg1=1), arg2=2, arg3=3)

with the equivalent

  1. >>> (df.pipe(h)
  2. .pipe(g, arg1=1)
  3. .pipe(f, arg2=2, arg3=3)
  4. )

Pandas encourages the second style, which is known as method chaining. pipe makes it easy to use your own or another library’s functions in method chains, alongside pandas’ methods.

In the example above, the functions f, g, and h each expected the DataFrame as the first positional argument. What if the function you wish to apply takes its data as, say, the second argument? In this case, provide pipe with a tuple of (callable, data_keyword). .pipe will route the DataFrame to the argument specified in the tuple.

For example, we can fit a regression using statsmodels. Their API expects a formula first and a DataFrame as the second argument, data. We pass in the function, keyword pair (sm.ols, 'data') to pipe:

  1. In [138]: import statsmodels.formula.api as sm
  2. In [139]: bb = pd.read_csv('data/baseball.csv', index_col='id')
  3. In [140]: (bb.query('h > 0')
  4. .....: .assign(ln_h = lambda df: np.log(df.h))
  5. .....: .pipe((sm.ols, 'data'), 'hr ~ ln_h + year + g + C(lg)')
  6. .....: .fit()
  7. .....: .summary()
  8. .....: )
  9. .....:
  10. Out[140]:
  11. <class 'statsmodels.iolib.summary.Summary'>
  12. """
  13. OLS Regression Results
  14. ==============================================================================
  15. Dep. Variable: hr R-squared: 0.685
  16. Model: OLS Adj. R-squared: 0.665
  17. Method: Least Squares F-statistic: 34.28
  18. Date: Sun, 05 Aug 2018 Prob (F-statistic): 3.48e-15
  19. Time: 11:57:36 Log-Likelihood: -205.92
  20. No. Observations: 68 AIC: 421.8
  21. Df Residuals: 63 BIC: 432.9
  22. Df Model: 4
  23. Covariance Type: nonrobust
  24. ===============================================================================
  25. coef std err t P>|t| [0.025 0.975]
  26. -------------------------------------------------------------------------------
  27. Intercept -8484.7720 4664.146 -1.819 0.074 -1.78e+04 835.780
  28. C(lg)[T.NL] -2.2736 1.325 -1.716 0.091 -4.922 0.375
  29. ln_h -1.3542 0.875 -1.547 0.127 -3.103 0.395
  30. year 4.2277 2.324 1.819 0.074 -0.417 8.872
  31. g 0.1841 0.029 6.258 0.000 0.125 0.243
  32. ==============================================================================
  33. Omnibus: 10.875 Durbin-Watson: 1.999
  34. Prob(Omnibus): 0.004 Jarque-Bera (JB): 17.298
  35. Skew: 0.537 Prob(JB): 0.000175
  36. Kurtosis: 5.225 Cond. No. 1.49e+07
  37. ==============================================================================
  38. Warnings:
  39. [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
  40. [2] The condition number is large, 1.49e+07. This might indicate that there are
  41. strong multicollinearity or other numerical problems.
  42. """

The pipe method is inspired by unix pipes and more recently dplyr and magrittr, which have introduced the popular (%>%) (read pipe) operator for R. The implementation of pipe here is quite clean and feels right at home in python. We encourage you to view the source code of pipe().

Row or Column-wise Function Application

Arbitrary functions can be applied along the axes of a DataFrame using the apply() method, which, like the descriptive statistics methods, takes an optional axis argument:

  1. In [141]: df.apply(np.mean)
  2. Out[141]:
  3. one -0.272211
  4. two 0.667306
  5. three 0.024661
  6. dtype: float64
  7. In [142]: df.apply(np.mean, axis=1)
  8. Out[142]:
  9. a 0.011457
  10. b 0.558507
  11. c 0.635781
  12. d -0.839603
  13. dtype: float64
  14. In [143]: df.apply(lambda x: x.max() - x.min())
  15. Out[143]:
  16. one 1.563773
  17. two 2.973170
  18. three 3.154112
  19. dtype: float64
  20. In [144]: df.apply(np.cumsum)
  21. Out[144]:
  22. one two three
  23. a -1.101558 1.124472 NaN
  24. b -1.278848 3.611576 -0.634293
  25. c -0.816633 3.125511 1.296901
  26. d NaN 2.669223 0.073983
  27. In [145]: df.apply(np.exp)
  28. Out[145]:
  29. one two three
  30. a 0.332353 3.078592 NaN
  31. b 0.837537 12.026397 0.53031
  32. c 1.587586 0.615041 6.89774
  33. d NaN 0.633631 0.29437

The apply() method will also dispatch on a string method name.

  1. In [146]: df.apply('mean')
  2. Out[146]:
  3. one -0.272211
  4. two 0.667306
  5. three 0.024661
  6. dtype: float64
  7. In [147]: df.apply('mean', axis=1)
  8. Out[147]:
  9. a 0.011457
  10. b 0.558507
  11. c 0.635781
  12. d -0.839603
  13. dtype: float64

The return type of the function passed to apply() affects the type of the final output from DataFrame.apply for the default behaviour:

  • If the applied function returns a Series, the final output is a DataFrame. The columns match the index of the Series returned by the applied function.
  • If the applied function returns any other type, the final output is a Series.

This default behaviour can be overridden using the result_type, which accepts three options: reduce, broadcast, and expand. These will determine how list-likes return values expand (or not) to a DataFrame.

apply() combined with some cleverness can be used to answer many questions about a data set. For example, suppose we wanted to extract the date where the maximum value for each column occurred:

  1. In [148]: tsdf = pd.DataFrame(np.random.randn(1000, 3), columns=['A', 'B', 'C'],
  2. .....: index=pd.date_range('1/1/2000', periods=1000))
  3. .....:
  4. In [149]: tsdf.apply(lambda x: x.idxmax())
  5. Out[149]:
  6. A 2001-04-25
  7. B 2002-05-31
  8. C 2002-09-25
  9. dtype: datetime64[ns]

You may also pass additional arguments and keyword arguments to the apply() method. For instance, consider the following function you would like to apply:

  1. def subtract_and_divide(x, sub, divide=1):
  2. return (x - sub) / divide

You may then apply this function as follows:

  1. df.apply(subtract_and_divide, args=(5,), divide=3)

Another useful feature is the ability to pass Series methods to carry out some Series operation on each column or row:

  1. In [150]: tsdf
  2. Out[150]:
  3. A B C
  4. 2000-01-01 -0.720299 0.546303 -0.082042
  5. 2000-01-02 0.200295 -0.577554 -0.908402
  6. 2000-01-03 0.102533 1.653614 0.303319
  7. 2000-01-04 NaN NaN NaN
  8. 2000-01-05 NaN NaN NaN
  9. 2000-01-06 NaN NaN NaN
  10. 2000-01-07 NaN NaN NaN
  11. 2000-01-08 0.532566 0.341548 0.150493
  12. 2000-01-09 0.330418 1.761200 0.567133
  13. 2000-01-10 -0.251020 1.020099 1.893177
  14. In [151]: tsdf.apply(pd.Series.interpolate)
  15. Out[151]:
  16. A B C
  17. 2000-01-01 -0.720299 0.546303 -0.082042
  18. 2000-01-02 0.200295 -0.577554 -0.908402
  19. 2000-01-03 0.102533 1.653614 0.303319
  20. 2000-01-04 0.188539 1.391201 0.272754
  21. 2000-01-05 0.274546 1.128788 0.242189
  22. 2000-01-06 0.360553 0.866374 0.211624
  23. 2000-01-07 0.446559 0.603961 0.181059
  24. 2000-01-08 0.532566 0.341548 0.150493
  25. 2000-01-09 0.330418 1.761200 0.567133
  26. 2000-01-10 -0.251020 1.020099 1.893177

Finally, apply() takes an argument raw which is False by default, which converts each row or column into a Series before applying the function. When set to True, the passed function will instead receive an ndarray object, which has positive performance implications if you do not need the indexing functionality.

Aggregation API

New in version 0.20.0.

The aggregation API allows one to express possibly multiple aggregation operations in a single concise way. This API is similar across pandas objects, see groupby API, the window functions API, and the resample API. The entry point for aggregation is DataFrame.aggregate(), or the alias DataFrame.agg().

We will use a similar starting frame from above:

  1. In [152]: tsdf = pd.DataFrame(np.random.randn(10, 3), columns=['A', 'B', 'C'],
  2. .....: index=pd.date_range('1/1/2000', periods=10))
  3. .....:
  4. In [153]: tsdf.iloc[3:7] = np.nan
  5. In [154]: tsdf
  6. Out[154]:
  7. A B C
  8. 2000-01-01 0.170247 -0.916844 0.835024
  9. 2000-01-02 1.259919 0.801111 0.445614
  10. 2000-01-03 1.453046 2.430373 0.653093
  11. 2000-01-04 NaN NaN NaN
  12. 2000-01-05 NaN NaN NaN
  13. 2000-01-06 NaN NaN NaN
  14. 2000-01-07 NaN NaN NaN
  15. 2000-01-08 -1.874526 0.569822 -0.609644
  16. 2000-01-09 0.812462 0.565894 -1.461363
  17. 2000-01-10 -0.985475 1.388154 -0.078747

Using a single function is equivalent to apply(). You can also pass named methods as strings. These will return a Series of the aggregated output:

  1. In [155]: tsdf.agg(np.sum)
  2. Out[155]:
  3. A 0.835673
  4. B 4.838510
  5. C -0.216025
  6. dtype: float64
  7. In [156]: tsdf.agg('sum')
  8. Out[156]:
  9. A 0.835673
  10. B 4.838510
  11. C -0.216025
  12. dtype: float64
  13. # these are equivalent to a ``.sum()`` because we are aggregating on a single function
  14. In [157]: tsdf.sum()
  15. Out[157]:
  16. A 0.835673
  17. B 4.838510
  18. C -0.216025
  19. dtype: float64

Single aggregations on a Series this will return a scalar value:

  1. In [158]: tsdf.A.agg('sum')
  2. Out[158]: 0.83567297915820504

Aggregating with multiple functions

You can pass multiple aggregation arguments as a list. The results of each of the passed functions will be a row in the resulting DataFrame. These are naturally named from the aggregation function.

  1. In [159]: tsdf.agg(['sum'])
  2. Out[159]:
  3. A B C
  4. sum 0.835673 4.83851 -0.216025

Multiple functions yield multiple rows:

  1. In [160]: tsdf.agg(['sum', 'mean'])
  2. Out[160]:
  3. A B C
  4. sum 0.835673 4.838510 -0.216025
  5. mean 0.139279 0.806418 -0.036004

On a Series, multiple functions return a Series, indexed by the function names:

  1. In [161]: tsdf.A.agg(['sum', 'mean'])
  2. Out[161]:
  3. sum 0.835673
  4. mean 0.139279
  5. Name: A, dtype: float64

Passing a lambda function will yield a <lambda> named row:

  1. In [162]: tsdf.A.agg(['sum', lambda x: x.mean()])
  2. Out[162]:
  3. sum 0.835673
  4. <lambda> 0.139279
  5. Name: A, dtype: float64

Passing a named function will yield that name for the row:

  1. In [163]: def mymean(x):
  2. .....: return x.mean()
  3. .....:
  4. In [164]: tsdf.A.agg(['sum', mymean])
  5. Out[164]:
  6. sum 0.835673
  7. mymean 0.139279
  8. Name: A, dtype: float64

Aggregating with a dict

Passing a dictionary of column names to a scalar or a list of scalars, to DataFrame.agg allows you to customize which functions are applied to which columns. Note that the results are not in any particular order, you can use an OrderedDict instead to guarantee ordering.

  1. In [165]: tsdf.agg({'A': 'mean', 'B': 'sum'})
  2. Out[165]:
  3. A 0.139279
  4. B 4.838510
  5. dtype: float64

Passing a list-like will generate a DataFrame output. You will get a matrix-like output of all of the aggregators. The output will consist of all unique functions. Those that are not noted for a particular column will be NaN:

  1. In [166]: tsdf.agg({'A': ['mean', 'min'], 'B': 'sum'})
  2. Out[166]:
  3. A B
  4. mean 0.139279 NaN
  5. min -1.874526 NaN
  6. sum NaN 4.83851

Mixed Dtypes

When presented with mixed dtypes that cannot aggregate, .agg will only take the valid aggregations. This is similar to how groupby .agg works.

  1. In [167]: mdf = pd.DataFrame({'A': [1, 2, 3],
  2. .....: 'B': [1., 2., 3.],
  3. .....: 'C': ['foo', 'bar', 'baz'],
  4. .....: 'D': pd.date_range('20130101', periods=3)})
  5. .....:
  6. In [168]: mdf.dtypes
  7. Out[168]:
  8. A int64
  9. B float64
  10. C object
  11. D datetime64[ns]
  12. dtype: object
  1. In [169]: mdf.agg(['min', 'sum'])
  2. Out[169]:
  3. A B C D
  4. min 1 1.0 bar 2013-01-01
  5. sum 6 6.0 foobarbaz NaT

Custom describe

With .agg() is it possible to easily create a custom describe function, similar to the built in describe function.

  1. In [170]: from functools import partial
  2. In [171]: q_25 = partial(pd.Series.quantile, q=0.25)
  3. In [172]: q_25.__name__ = '25%'
  4. In [173]: q_75 = partial(pd.Series.quantile, q=0.75)
  5. In [174]: q_75.__name__ = '75%'
  6. In [175]: tsdf.agg(['count', 'mean', 'std', 'min', q_25, 'median', q_75, 'max'])
  7. Out[175]:
  8. A B C
  9. count 6.000000 6.000000 6.000000
  10. mean 0.139279 0.806418 -0.036004
  11. std 1.323362 1.100830 0.874990
  12. min -1.874526 -0.916844 -1.461363
  13. 25% -0.696544 0.566876 -0.476920
  14. median 0.491354 0.685467 0.183433
  15. 75% 1.148055 1.241393 0.601223
  16. max 1.453046 2.430373 0.835024

Transform API

New in version 0.20.0.

The transform() method returns an object that is indexed the same (same size) as the original. This API allows you to provide multiple operations at the same time rather than one-by-one. Its API is quite similar to the .agg API.

We create a frame similar to the one used in the above sections.

  1. In [176]: tsdf = pd.DataFrame(np.random.randn(10, 3), columns=['A', 'B', 'C'],
  2. .....: index=pd.date_range('1/1/2000', periods=10))
  3. .....:
  4. In [177]: tsdf.iloc[3:7] = np.nan
  5. In [178]: tsdf
  6. Out[178]:
  7. A B C
  8. 2000-01-01 -0.578465 -0.503335 -0.987140
  9. 2000-01-02 -0.767147 -0.266046 1.083797
  10. 2000-01-03 0.195348 0.722247 -0.894537
  11. 2000-01-04 NaN NaN NaN
  12. 2000-01-05 NaN NaN NaN
  13. 2000-01-06 NaN NaN NaN
  14. 2000-01-07 NaN NaN NaN
  15. 2000-01-08 -0.556397 0.542165 -0.308675
  16. 2000-01-09 -1.010924 -0.672504 -1.139222
  17. 2000-01-10 0.354653 0.563622 -0.365106

Transform the entire frame. .transform() allows input functions as: a NumPy function, a string function name or a user defined function.

  1. In [179]: tsdf.transform(np.abs)
  2. Out[179]:
  3. A B C
  4. 2000-01-01 0.578465 0.503335 0.987140
  5. 2000-01-02 0.767147 0.266046 1.083797
  6. 2000-01-03 0.195348 0.722247 0.894537
  7. 2000-01-04 NaN NaN NaN
  8. 2000-01-05 NaN NaN NaN
  9. 2000-01-06 NaN NaN NaN
  10. 2000-01-07 NaN NaN NaN
  11. 2000-01-08 0.556397 0.542165 0.308675
  12. 2000-01-09 1.010924 0.672504 1.139222
  13. 2000-01-10 0.354653 0.563622 0.365106
  14. In [180]: tsdf.transform('abs')
  15. Out[180]:
  16. A B C
  17. 2000-01-01 0.578465 0.503335 0.987140
  18. 2000-01-02 0.767147 0.266046 1.083797
  19. 2000-01-03 0.195348 0.722247 0.894537
  20. 2000-01-04 NaN NaN NaN
  21. 2000-01-05 NaN NaN NaN
  22. 2000-01-06 NaN NaN NaN
  23. 2000-01-07 NaN NaN NaN
  24. 2000-01-08 0.556397 0.542165 0.308675
  25. 2000-01-09 1.010924 0.672504 1.139222
  26. 2000-01-10 0.354653 0.563622 0.365106
  27. In [181]: tsdf.transform(lambda x: x.abs())
  28. Out[181]:
  29. A B C
  30. 2000-01-01 0.578465 0.503335 0.987140
  31. 2000-01-02 0.767147 0.266046 1.083797
  32. 2000-01-03 0.195348 0.722247 0.894537
  33. 2000-01-04 NaN NaN NaN
  34. 2000-01-05 NaN NaN NaN
  35. 2000-01-06 NaN NaN NaN
  36. 2000-01-07 NaN NaN NaN
  37. 2000-01-08 0.556397 0.542165 0.308675
  38. 2000-01-09 1.010924 0.672504 1.139222
  39. 2000-01-10 0.354653 0.563622 0.365106

Here transform() received a single function; this is equivalent to a ufunc application.

  1. In [182]: np.abs(tsdf)
  2. Out[182]:
  3. A B C
  4. 2000-01-01 0.578465 0.503335 0.987140
  5. 2000-01-02 0.767147 0.266046 1.083797
  6. 2000-01-03 0.195348 0.722247 0.894537
  7. 2000-01-04 NaN NaN NaN
  8. 2000-01-05 NaN NaN NaN
  9. 2000-01-06 NaN NaN NaN
  10. 2000-01-07 NaN NaN NaN
  11. 2000-01-08 0.556397 0.542165 0.308675
  12. 2000-01-09 1.010924 0.672504 1.139222
  13. 2000-01-10 0.354653 0.563622 0.365106

Passing a single function to .transform() with a Series will yield a single Series in return.

  1. In [183]: tsdf.A.transform(np.abs)
  2. Out[183]:
  3. 2000-01-01 0.578465
  4. 2000-01-02 0.767147
  5. 2000-01-03 0.195348
  6. 2000-01-04 NaN
  7. 2000-01-05 NaN
  8. 2000-01-06 NaN
  9. 2000-01-07 NaN
  10. 2000-01-08 0.556397
  11. 2000-01-09 1.010924
  12. 2000-01-10 0.354653
  13. Freq: D, Name: A, dtype: float64

Transform with multiple functions

Passing multiple functions will yield a column multi-indexed DataFrame. The first level will be the original frame column names; the second level will be the names of the transforming functions.

  1. In [184]: tsdf.transform([np.abs, lambda x: x+1])
  2. Out[184]:
  3. A B C
  4. absolute <lambda> absolute <lambda> absolute <lambda>
  5. 2000-01-01 0.578465 0.421535 0.503335 0.496665 0.987140 0.012860
  6. 2000-01-02 0.767147 0.232853 0.266046 0.733954 1.083797 2.083797
  7. 2000-01-03 0.195348 1.195348 0.722247 1.722247 0.894537 0.105463
  8. 2000-01-04 NaN NaN NaN NaN NaN NaN
  9. 2000-01-05 NaN NaN NaN NaN NaN NaN
  10. 2000-01-06 NaN NaN NaN NaN NaN NaN
  11. 2000-01-07 NaN NaN NaN NaN NaN NaN
  12. 2000-01-08 0.556397 0.443603 0.542165 1.542165 0.308675 0.691325
  13. 2000-01-09 1.010924 -0.010924 0.672504 0.327496 1.139222 -0.139222
  14. 2000-01-10 0.354653 1.354653 0.563622 1.563622 0.365106 0.634894

Passing multiple functions to a Series will yield a DataFrame. The resulting column names will be the transforming functions.

  1. In [185]: tsdf.A.transform([np.abs, lambda x: x+1])
  2. Out[185]:
  3. absolute <lambda>
  4. 2000-01-01 0.578465 0.421535
  5. 2000-01-02 0.767147 0.232853
  6. 2000-01-03 0.195348 1.195348
  7. 2000-01-04 NaN NaN
  8. 2000-01-05 NaN NaN
  9. 2000-01-06 NaN NaN
  10. 2000-01-07 NaN NaN
  11. 2000-01-08 0.556397 0.443603
  12. 2000-01-09 1.010924 -0.010924
  13. 2000-01-10 0.354653 1.354653

Transforming with a dict

Passing a dict of functions will allow selective transforming per column.

  1. In [186]: tsdf.transform({'A': np.abs, 'B': lambda x: x+1})
  2. Out[186]:
  3. A B
  4. 2000-01-01 0.578465 0.496665
  5. 2000-01-02 0.767147 0.733954
  6. 2000-01-03 0.195348 1.722247
  7. 2000-01-04 NaN NaN
  8. 2000-01-05 NaN NaN
  9. 2000-01-06 NaN NaN
  10. 2000-01-07 NaN NaN
  11. 2000-01-08 0.556397 1.542165
  12. 2000-01-09 1.010924 0.327496
  13. 2000-01-10 0.354653 1.563622

Passing a dict of lists will generate a multi-indexed DataFrame with these selective transforms.

  1. In [187]: tsdf.transform({'A': np.abs, 'B': [lambda x: x+1, 'sqrt']})
  2. Out[187]:
  3. A B
  4. absolute <lambda> sqrt
  5. 2000-01-01 0.578465 0.496665 NaN
  6. 2000-01-02 0.767147 0.733954 NaN
  7. 2000-01-03 0.195348 1.722247 0.849851
  8. 2000-01-04 NaN NaN NaN
  9. 2000-01-05 NaN NaN NaN
  10. 2000-01-06 NaN NaN NaN
  11. 2000-01-07 NaN NaN NaN
  12. 2000-01-08 0.556397 1.542165 0.736318
  13. 2000-01-09 1.010924 0.327496 NaN
  14. 2000-01-10 0.354653 1.563622 0.750748

Applying Elementwise Functions

Since not all functions can be vectorized (accept NumPy arrays and return another array or value), the methods applymap() on DataFrame and analogously map() on Series accept any Python function taking a single value and returning a single value. For example:

  1. In [188]: df4
  2. Out[188]:
  3. one two three
  4. a -1.101558 1.124472 NaN
  5. b -0.177289 2.487104 -0.634293
  6. c 0.462215 -0.486066 1.931194
  7. d NaN -0.456288 -1.222918
  8. In [189]: f = lambda x: len(str(x))
  9. In [190]: df4['one'].map(f)
  10. Out[190]:
  11. a 19
  12. b 20
  13. c 18
  14. d 3
  15. Name: one, dtype: int64
  16. In [191]: df4.applymap(f)
  17. Out[191]:
  18. one two three
  19. a 19 18 3
  20. b 20 18 19
  21. c 18 20 18
  22. d 3 19 19

Series.map() has an additional feature; it can be used to easily “link” or “map” values defined by a secondary series. This is closely related to merging/joining functionality:

  1. In [192]: s = pd.Series(['six', 'seven', 'six', 'seven', 'six'],
  2. .....: index=['a', 'b', 'c', 'd', 'e'])
  3. .....:
  4. In [193]: t = pd.Series({'six' : 6., 'seven' : 7.})
  5. In [194]: s
  6. Out[194]:
  7. a six
  8. b seven
  9. c six
  10. d seven
  11. e six
  12. dtype: object
  13. In [195]: s.map(t)
  14. Out[195]:
  15. a 6.0
  16. b 7.0
  17. c 6.0
  18. d 7.0
  19. e 6.0
  20. dtype: float64

Applying with a Panel

Applying with a Panel will pass a Series to the applied function. If the applied function returns a Series, the result of the application will be a Panel. If the applied function reduces to a scalar, the result of the application will be a DataFrame.

  1. In [196]: import pandas.util.testing as tm
  2. In [197]: panel = tm.makePanel(5)
  3. In [198]: panel
  4. Out[198]:
  5. <class 'pandas.core.panel.Panel'>
  6. Dimensions: 3 (items) x 5 (major_axis) x 4 (minor_axis)
  7. Items axis: ItemA to ItemC
  8. Major_axis axis: 2000-01-03 00:00:00 to 2000-01-07 00:00:00
  9. Minor_axis axis: A to D
  10. In [199]: panel['ItemA']
  11. Out[199]:
  12. A B C D
  13. 2000-01-03 1.092702 0.604244 -2.927808 0.339642
  14. 2000-01-04 -1.481449 -0.487265 0.082065 1.499953
  15. 2000-01-05 1.781190 1.990533 0.456554 -0.317818
  16. 2000-01-06 -0.031543 0.327007 -1.757911 0.447371
  17. 2000-01-07 0.480993 1.053639 0.982407 -1.315799

A transformational apply.

  1. In [200]: result = panel.apply(lambda x: x*2, axis='items')
  2. In [201]: result
  3. Out[201]:
  4. <class 'pandas.core.panel.Panel'>
  5. Dimensions: 3 (items) x 5 (major_axis) x 4 (minor_axis)
  6. Items axis: ItemA to ItemC
  7. Major_axis axis: 2000-01-03 00:00:00 to 2000-01-07 00:00:00
  8. Minor_axis axis: A to D
  9. In [202]: result['ItemA']
  10. Out[202]:
  11. A B C D
  12. 2000-01-03 2.185405 1.208489 -5.855616 0.679285
  13. 2000-01-04 -2.962899 -0.974530 0.164130 2.999905
  14. 2000-01-05 3.562379 3.981066 0.913107 -0.635635
  15. 2000-01-06 -0.063086 0.654013 -3.515821 0.894742
  16. 2000-01-07 0.961986 2.107278 1.964815 -2.631598

A reduction operation.

  1. In [203]: panel.apply(lambda x: x.dtype, axis='items')
  2. Out[203]:
  3. A B C D
  4. 2000-01-03 float64 float64 float64 float64
  5. 2000-01-04 float64 float64 float64 float64
  6. 2000-01-05 float64 float64 float64 float64
  7. 2000-01-06 float64 float64 float64 float64
  8. 2000-01-07 float64 float64 float64 float64

A similar reduction type operation.

  1. In [204]: panel.apply(lambda x: x.sum(), axis='major_axis')
  2. Out[204]:
  3. ItemA ItemB ItemC
  4. A 1.841893 0.918017 -1.160547
  5. B 3.488158 -2.629773 0.603397
  6. C -3.164692 0.805970 0.806501
  7. D 0.653349 -0.152299 0.252577

This last reduction is equivalent to:

  1. In [205]: panel.sum('major_axis')
  2. Out[205]:
  3. ItemA ItemB ItemC
  4. A 1.841893 0.918017 -1.160547
  5. B 3.488158 -2.629773 0.603397
  6. C -3.164692 0.805970 0.806501
  7. D 0.653349 -0.152299 0.252577

A transformation operation that returns a Panel, but is computing the z-score across the major_axis.

  1. In [206]: result = panel.apply(
  2. .....: lambda x: (x-x.mean())/x.std(),
  3. .....: axis='major_axis')
  4. .....:
  5. In [207]: result
  6. Out[207]:
  7. <class 'pandas.core.panel.Panel'>
  8. Dimensions: 3 (items) x 5 (major_axis) x 4 (minor_axis)
  9. Items axis: ItemA to ItemC
  10. Major_axis axis: 2000-01-03 00:00:00 to 2000-01-07 00:00:00
  11. Minor_axis axis: A to D
  12. In [208]: result['ItemA']
  13. Out[208]:
  14. A B C D
  15. 2000-01-03 0.585813 -0.102070 -1.394063 0.201263
  16. 2000-01-04 -1.496089 -1.295066 0.434343 1.318766
  17. 2000-01-05 1.142642 1.413112 0.661833 -0.431942
  18. 2000-01-06 -0.323445 -0.405085 -0.683386 0.305017
  19. 2000-01-07 0.091079 0.389108 0.981273 -1.393105

Apply can also accept multiple axes in the axis argument. This will pass a DataFrame of the cross-section to the applied function.

  1. In [209]: f = lambda x: ((x.T-x.mean(1))/x.std(1)).T
  2. In [210]: result = panel.apply(f, axis = ['items','major_axis'])
  3. In [211]: result
  4. Out[211]:
  5. <class 'pandas.core.panel.Panel'>
  6. Dimensions: 4 (items) x 5 (major_axis) x 3 (minor_axis)
  7. Items axis: A to D
  8. Major_axis axis: 2000-01-03 00:00:00 to 2000-01-07 00:00:00
  9. Minor_axis axis: ItemA to ItemC
  10. In [212]: result.loc[:,:,'ItemA']
  11. Out[212]:
  12. A B C D
  13. 2000-01-03 0.859304 0.448509 -1.109374 0.397237
  14. 2000-01-04 -1.053319 -1.063370 0.986639 1.152266
  15. 2000-01-05 1.106511 1.143185 -0.093917 -0.583083
  16. 2000-01-06 0.561619 -0.835608 -1.075936 0.194525
  17. 2000-01-07 -0.339514 1.097901 0.747522 -1.147605

This is equivalent to the following:

  1. In [213]: result = pd.Panel(dict([ (ax, f(panel.loc[:,:,ax]))
  2. .....: for ax in panel.minor_axis ]))
  3. .....:
  4. In [214]: result
  5. Out[214]:
  6. <class 'pandas.core.panel.Panel'>
  7. Dimensions: 4 (items) x 5 (major_axis) x 3 (minor_axis)
  8. Items axis: A to D
  9. Major_axis axis: 2000-01-03 00:00:00 to 2000-01-07 00:00:00
  10. Minor_axis axis: ItemA to ItemC
  11. In [215]: result.loc[:,:,'ItemA']
  12. Out[215]:
  13. A B C D
  14. 2000-01-03 0.859304 0.448509 -1.109374 0.397237
  15. 2000-01-04 -1.053319 -1.063370 0.986639 1.152266
  16. 2000-01-05 1.106511 1.143185 -0.093917 -0.583083
  17. 2000-01-06 0.561619 -0.835608 -1.075936 0.194525
  18. 2000-01-07 -0.339514 1.097901 0.747522 -1.147605