分组操作(Grouping)

The grouping docs.

Basic grouping with apply

Unlike agg, apply’s callable is passed a sub-DataFrame which gives you access to all the columns

  1. In [83]: df = pd.DataFrame({'animal': 'cat dog cat fish dog cat cat'.split(),
  2. ....: 'size': list('SSMMMLL'),
  3. ....: 'weight': [8, 10, 11, 1, 20, 12, 12],
  4. ....: 'adult' : [False] * 5 + [True] * 2}); df
  5. ....:
  6. Out[83]:
  7. animal size weight adult
  8. 0 cat S 8 False
  9. 1 dog S 10 False
  10. 2 cat M 11 False
  11. 3 fish M 1 False
  12. 4 dog M 20 False
  13. 5 cat L 12 True
  14. 6 cat L 12 True
  15. #List the size of the animals with the highest weight.
  16. In [84]: df.groupby('animal').apply(lambda subf: subf['size'][subf['weight'].idxmax()])
  17. Out[84]:
  18. animal
  19. cat L
  20. dog M
  21. fish M
  22. dtype: object

Using get_group

  1. In [85]: gb = df.groupby(['animal'])
  2. In [86]: gb.get_group('cat')
  3. Out[86]:
  4. animal size weight adult
  5. 0 cat S 8 False
  6. 2 cat M 11 False
  7. 5 cat L 12 True
  8. 6 cat L 12 True

Apply to different items in a group

  1. In [87]: def GrowUp(x):
  2. ....: avg_weight = sum(x[x['size'] == 'S'].weight * 1.5)
  3. ....: avg_weight += sum(x[x['size'] == 'M'].weight * 1.25)
  4. ....: avg_weight += sum(x[x['size'] == 'L'].weight)
  5. ....: avg_weight /= len(x)
  6. ....: return pd.Series(['L',avg_weight,True], index=['size', 'weight', 'adult'])
  7. ....:
  8. In [88]: expected_df = gb.apply(GrowUp)
  9. In [89]: expected_df
  10. Out[89]:
  11. size weight adult
  12. animal
  13. cat L 12.4375 True
  14. dog L 20.0000 True
  15. fish L 1.2500 True

Expanding Apply

  1. In [90]: S = pd.Series([i / 100.0 for i in range(1,11)])
  2. In [91]: def CumRet(x,y):
  3. ....: return x * (1 + y)
  4. ....:
  5. In [92]: def Red(x):
  6. ....: return functools.reduce(CumRet,x,1.0)
  7. ....:
  8. In [93]: S.expanding().apply(Red, raw=True)
  9. Out[93]:
  10. 0 1.010000
  11. 1 1.030200
  12. 2 1.061106
  13. 3 1.103550
  14. 4 1.158728
  15. 5 1.228251
  16. 6 1.314229
  17. 7 1.419367
  18. 8 1.547110
  19. 9 1.701821
  20. dtype: float64

Replacing some values with mean of the rest of a group

  1. In [94]: df = pd.DataFrame({'A' : [1, 1, 2, 2], 'B' : [1, -1, 1, 2]})
  2. In [95]: gb = df.groupby('A')
  3. In [96]: def replace(g):
  4. ....: mask = g < 0
  5. ....: g.loc[mask] = g[~mask].mean()
  6. ....: return g
  7. ....:
  8. In [97]: gb.transform(replace)
  9. Out[97]:
  10. B
  11. 0 1.0
  12. 1 1.0
  13. 2 1.0
  14. 3 2.0

Sort groups by aggregated data

  1. In [98]: df = pd.DataFrame({'code': ['foo', 'bar', 'baz'] * 2,
  2. ....: 'data': [0.16, -0.21, 0.33, 0.45, -0.59, 0.62],
  3. ....: 'flag': [False, True] * 3})
  4. ....:
  5. In [99]: code_groups = df.groupby('code')
  6. In [100]: agg_n_sort_order = code_groups[['data']].transform(sum).sort_values(by='data')
  7. In [101]: sorted_df = df.loc[agg_n_sort_order.index]
  8. In [102]: sorted_df
  9. Out[102]:
  10. code data flag
  11. 1 bar -0.21 True
  12. 4 bar -0.59 False
  13. 0 foo 0.16 False
  14. 3 foo 0.45 True
  15. 2 baz 0.33 False
  16. 5 baz 0.62 True

Create multiple aggregated columns

  1. In [103]: rng = pd.date_range(start="2014-10-07",periods=10,freq='2min')
  2. In [104]: ts = pd.Series(data = list(range(10)), index = rng)
  3. In [105]: def MyCust(x):
  4. .....: if len(x) > 2:
  5. .....: return x[1] * 1.234
  6. .....: return pd.NaT
  7. .....:
  8. In [106]: mhc = {'Mean' : np.mean, 'Max' : np.max, 'Custom' : MyCust}
  9. In [107]: ts.resample("5min").apply(mhc)
  10. Out[107]:
  11. Custom 2014-10-07 00:00:00 1.234
  12. 2014-10-07 00:05:00 NaT
  13. 2014-10-07 00:10:00 7.404
  14. 2014-10-07 00:15:00 NaT
  15. Max 2014-10-07 00:00:00 2
  16. 2014-10-07 00:05:00 4
  17. 2014-10-07 00:10:00 7
  18. 2014-10-07 00:15:00 9
  19. Mean 2014-10-07 00:00:00 1
  20. 2014-10-07 00:05:00 3.5
  21. 2014-10-07 00:10:00 6
  22. 2014-10-07 00:15:00 8.5
  23. dtype: object
  24. In [108]: ts
  25. Out[108]:
  26. 2014-10-07 00:00:00 0
  27. 2014-10-07 00:02:00 1
  28. 2014-10-07 00:04:00 2
  29. 2014-10-07 00:06:00 3
  30. 2014-10-07 00:08:00 4
  31. 2014-10-07 00:10:00 5
  32. 2014-10-07 00:12:00 6
  33. 2014-10-07 00:14:00 7
  34. 2014-10-07 00:16:00 8
  35. 2014-10-07 00:18:00 9
  36. Freq: 2T, dtype: int64

Create a value counts column and reassign back to the DataFrame

  1. In [109]: df = pd.DataFrame({'Color': 'Red Red Red Blue'.split(),
  2. .....: 'Value': [100, 150, 50, 50]}); df
  3. .....:
  4. Out[109]:
  5. Color Value
  6. 0 Red 100
  7. 1 Red 150
  8. 2 Red 50
  9. 3 Blue 50
  10. In [110]: df['Counts'] = df.groupby(['Color']).transform(len)
  11. In [111]: df
  12. Out[111]:
  13. Color Value Counts
  14. 0 Red 100 3
  15. 1 Red 150 3
  16. 2 Red 50 3
  17. 3 Blue 50 1

Shift groups of the values in a column based on the index

  1. In [112]: df = pd.DataFrame(
  2. .....: {u'line_race': [10, 10, 8, 10, 10, 8],
  3. .....: u'beyer': [99, 102, 103, 103, 88, 100]},
  4. .....: index=[u'Last Gunfighter', u'Last Gunfighter', u'Last Gunfighter',
  5. .....: u'Paynter', u'Paynter', u'Paynter']); df
  6. .....:
  7. Out[112]:
  8. line_race beyer
  9. Last Gunfighter 10 99
  10. Last Gunfighter 10 102
  11. Last Gunfighter 8 103
  12. Paynter 10 103
  13. Paynter 10 88
  14. Paynter 8 100
  15. In [113]: df['beyer_shifted'] = df.groupby(level=0)['beyer'].shift(1)
  16. In [114]: df
  17. Out[114]:
  18. line_race beyer beyer_shifted
  19. Last Gunfighter 10 99 NaN
  20. Last Gunfighter 10 102 99.0
  21. Last Gunfighter 8 103 102.0
  22. Paynter 10 103 NaN
  23. Paynter 10 88 103.0
  24. Paynter 8 100 88.0

Select row with maximum value from each group

  1. In [115]: df = pd.DataFrame({'host':['other','other','that','this','this'],
  2. .....: 'service':['mail','web','mail','mail','web'],
  3. .....: 'no':[1, 2, 1, 2, 1]}).set_index(['host', 'service'])
  4. .....:
  5. In [116]: mask = df.groupby(level=0).agg('idxmax')
  6. In [117]: df_count = df.loc[mask['no']].reset_index()
  7. In [118]: df_count
  8. Out[118]:
  9. host service no
  10. 0 other web 2
  11. 1 that mail 1
  12. 2 this mail 2

Grouping like Python’s itertools.groupby

  1. In [119]: df = pd.DataFrame([0, 1, 0, 1, 1, 1, 0, 1, 1], columns=['A'])
  2. In [120]: df.A.groupby((df.A != df.A.shift()).cumsum()).groups
  3. Out[120]:
  4. {1: Int64Index([0], dtype='int64'),
  5. 2: Int64Index([1], dtype='int64'),
  6. 3: Int64Index([2], dtype='int64'),
  7. 4: Int64Index([3, 4, 5], dtype='int64'),
  8. 5: Int64Index([6], dtype='int64'),
  9. 6: Int64Index([7, 8], dtype='int64')}
  10. In [121]: df.A.groupby((df.A != df.A.shift()).cumsum()).cumsum()
  11. Out[121]:
  12. 0 0
  13. 1 1
  14. 2 0
  15. 3 1
  16. 4 2
  17. 5 3
  18. 6 0
  19. 7 1
  20. 8 2
  21. Name: A, dtype: int64

Expanding Data

Alignment and to-date

Rolling Computation window based on values instead of counts

Rolling Mean by Time Interval

Splitting

Splitting a frame

Create a list of dataframes, split using a delineation based on logic included in rows.

  1. In [122]: df = pd.DataFrame(data={'Case' : ['A','A','A','B','A','A','B','A','A'],
  2. .....: 'Data' : np.random.randn(9)})
  3. .....:
  4. In [123]: dfs = list(zip(*df.groupby((1*(df['Case']=='B')).cumsum().rolling(window=3,min_periods=1).median())))[-1]
  5. In [124]: dfs[0]
  6. Out[124]:
  7. Case Data
  8. 0 A 0.174068
  9. 1 A -0.439461
  10. 2 A -0.741343
  11. 3 B -0.079673
  12. In [125]: dfs[1]
  13. Out[125]:
  14. Case Data
  15. 4 A -0.922875
  16. 5 A 0.303638
  17. 6 B -0.917368
  18. In [126]: dfs[2]
  19. Out[126]:
  20. Case Data
  21. 7 A -1.624062
  22. 8 A -0.758514

Pivot

The Pivot docs.

Partial sums and subtotals

  1. In [127]: df = pd.DataFrame(data={'Province' : ['ON','QC','BC','AL','AL','MN','ON'],
  2. .....: 'City' : ['Toronto','Montreal','Vancouver','Calgary','Edmonton','Winnipeg','Windsor'],
  3. .....: 'Sales' : [13,6,16,8,4,3,1]})
  4. .....:
  5. In [128]: table = pd.pivot_table(df,values=['Sales'],index=['Province'],columns=['City'],aggfunc=np.sum,margins=True)
  6. In [129]: table.stack('City')
  7. Out[129]:
  8. Sales
  9. Province City
  10. AL All 12.0
  11. Calgary 8.0
  12. Edmonton 4.0
  13. BC All 16.0
  14. Vancouver 16.0
  15. MN All 3.0
  16. Winnipeg 3.0
  17. ... ...
  18. All Calgary 8.0
  19. Edmonton 4.0
  20. Montreal 6.0
  21. Toronto 13.0
  22. Vancouver 16.0
  23. Windsor 1.0
  24. Winnipeg 3.0
  25. [20 rows x 1 columns]

Frequency table like plyr in R

  1. In [130]: grades = [48,99,75,80,42,80,72,68,36,78]
  2. In [131]: df = pd.DataFrame( {'ID': ["x%d" % r for r in range(10)],
  3. .....: 'Gender' : ['F', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'M', 'M'],
  4. .....: 'ExamYear': ['2007','2007','2007','2008','2008','2008','2008','2009','2009','2009'],
  5. .....: 'Class': ['algebra', 'stats', 'bio', 'algebra', 'algebra', 'stats', 'stats', 'algebra', 'bio', 'bio'],
  6. .....: 'Participated': ['yes','yes','yes','yes','no','yes','yes','yes','yes','yes'],
  7. .....: 'Passed': ['yes' if x > 50 else 'no' for x in grades],
  8. .....: 'Employed': [True,True,True,False,False,False,False,True,True,False],
  9. .....: 'Grade': grades})
  10. .....:
  11. In [132]: df.groupby('ExamYear').agg({'Participated': lambda x: x.value_counts()['yes'],
  12. .....: 'Passed': lambda x: sum(x == 'yes'),
  13. .....: 'Employed' : lambda x : sum(x),
  14. .....: 'Grade' : lambda x : sum(x) / len(x)})
  15. .....:
  16. Out[132]:
  17. Participated Passed Employed Grade
  18. ExamYear
  19. 2007 3 2 3 74.000000
  20. 2008 3 3 0 68.500000
  21. 2009 3 2 2 60.666667

Plot pandas DataFrame with year over year data

To create year and month crosstabulation:

  1. In [133]: df = pd.DataFrame({'value': np.random.randn(36)},
  2. .....: index=pd.date_range('2011-01-01', freq='M', periods=36))
  3. .....:
  4. In [134]: pd.pivot_table(df, index=df.index.month, columns=df.index.year,
  5. .....: values='value', aggfunc='sum')
  6. .....:
  7. Out[134]:
  8. 2011 2012 2013
  9. 1 -0.560859 0.120930 0.516870
  10. 2 -0.589005 -0.210518 0.343125
  11. 3 -1.070678 -0.931184 2.137827
  12. 4 -1.681101 0.240647 0.452429
  13. 5 0.403776 -0.027462 0.483103
  14. 6 0.609862 0.033113 0.061495
  15. 7 0.387936 -0.658418 0.240767
  16. 8 1.815066 0.324102 0.782413
  17. 9 0.705200 -1.403048 0.628462
  18. 10 -0.668049 -0.581967 -0.880627
  19. 11 0.242501 -1.233862 0.777575
  20. 12 0.313421 -3.520876 -0.779367

Apply

Rolling Apply to Organize - Turning embedded lists into a multi-index frame

  1. In [135]: df = pd.DataFrame(data={'A' : [[2,4,8,16],[100,200],[10,20,30]], 'B' : [['a','b','c'],['jj','kk'],['ccc']]},index=['I','II','III'])
  2. In [136]: def SeriesFromSubList(aList):
  3. .....: return pd.Series(aList)
  4. .....:
  5. In [137]: df_orgz = pd.concat(dict([ (ind,row.apply(SeriesFromSubList)) for ind,row in df.iterrows() ]))

Rolling Apply with a DataFrame returning a Series

Rolling Apply to multiple columns where function calculates a Series before a Scalar from the Series is returned

  1. In [138]: df = pd.DataFrame(data=np.random.randn(2000,2)/10000,
  2. .....: index=pd.date_range('2001-01-01',periods=2000),
  3. .....: columns=['A','B']); df
  4. .....:
  5. Out[138]:
  6. A B
  7. 2001-01-01 0.000032 -0.000004
  8. 2001-01-02 -0.000001 0.000207
  9. 2001-01-03 0.000120 -0.000220
  10. 2001-01-04 -0.000083 -0.000165
  11. 2001-01-05 -0.000047 0.000156
  12. 2001-01-06 0.000027 0.000104
  13. 2001-01-07 0.000041 -0.000101
  14. ... ... ...
  15. 2006-06-17 -0.000034 0.000034
  16. 2006-06-18 0.000002 0.000166
  17. 2006-06-19 0.000023 -0.000081
  18. 2006-06-20 -0.000061 0.000012
  19. 2006-06-21 -0.000111 0.000027
  20. 2006-06-22 -0.000061 -0.000009
  21. 2006-06-23 0.000074 -0.000138
  22. [2000 rows x 2 columns]
  23. In [139]: def gm(aDF,Const):
  24. .....: v = ((((aDF.A+aDF.B)+1).cumprod())-1)*Const
  25. .....: return (aDF.index[0],v.iloc[-1])
  26. .....:
  27. In [140]: S = pd.Series(dict([ gm(df.iloc[i:min(i+51,len(df)-1)],5) for i in range(len(df)-50) ])); S
  28. Out[140]:
  29. 2001-01-01 -0.001373
  30. 2001-01-02 -0.001705
  31. 2001-01-03 -0.002885
  32. 2001-01-04 -0.002987
  33. 2001-01-05 -0.002384
  34. 2001-01-06 -0.004700
  35. 2001-01-07 -0.005500
  36. ...
  37. 2006-04-28 -0.002682
  38. 2006-04-29 -0.002436
  39. 2006-04-30 -0.002602
  40. 2006-05-01 -0.001785
  41. 2006-05-02 -0.001799
  42. 2006-05-03 -0.000605
  43. 2006-05-04 -0.000541
  44. Length: 1950, dtype: float64

Rolling apply with a DataFrame returning a Scalar

Rolling Apply to multiple columns where function returns a Scalar (Volume Weighted Average Price)

  1. In [141]: rng = pd.date_range(start = '2014-01-01',periods = 100)
  2. In [142]: df = pd.DataFrame({'Open' : np.random.randn(len(rng)),
  3. .....: 'Close' : np.random.randn(len(rng)),
  4. .....: 'Volume' : np.random.randint(100,2000,len(rng))}, index=rng); df
  5. .....:
  6. Out[142]:
  7. Open Close Volume
  8. 2014-01-01 0.011174 -0.653039 1581
  9. 2014-01-02 0.214258 1.314205 1707
  10. 2014-01-03 -1.046922 -0.341915 1768
  11. 2014-01-04 -0.752902 -1.303586 836
  12. 2014-01-05 -0.410793 0.396288 694
  13. 2014-01-06 0.648401 -0.548006 796
  14. 2014-01-07 0.737320 0.481380 265
  15. ... ... ... ...
  16. 2014-04-04 0.120378 -2.548128 564
  17. 2014-04-05 0.231661 0.223346 1908
  18. 2014-04-06 0.952664 1.228841 1090
  19. 2014-04-07 -0.176090 0.552784 1813
  20. 2014-04-08 1.781318 -0.795389 1103
  21. 2014-04-09 -0.753493 -0.018815 1456
  22. 2014-04-10 -1.047997 1.138197 1193
  23. [100 rows x 3 columns]
  24. In [143]: def vwap(bars): return ((bars.Close*bars.Volume).sum()/bars.Volume.sum())
  25. In [144]: window = 5
  26. In [145]: s = pd.concat([ (pd.Series(vwap(df.iloc[i:i+window]), index=[df.index[i+window]])) for i in range(len(df)-window) ]);
  27. In [146]: s.round(2)
  28. Out[146]:
  29. 2014-01-06 -0.03
  30. 2014-01-07 0.07
  31. 2014-01-08 -0.40
  32. 2014-01-09 -0.81
  33. 2014-01-10 -0.63
  34. 2014-01-11 -0.86
  35. 2014-01-12 -0.36
  36. ...
  37. 2014-04-04 -1.27
  38. 2014-04-05 -1.36
  39. 2014-04-06 -0.73
  40. 2014-04-07 0.04
  41. 2014-04-08 0.21
  42. 2014-04-09 0.07
  43. 2014-04-10 0.25
  44. Length: 95, dtype: float64