query()方法

DataFrame objects have a query() method that allows selection using an expression.

You can get the value of the frame where column b has values between the values of columns a and c. For example:

  1. In [202]: n = 10
  2. In [203]: df = pd.DataFrame(np.random.rand(n, 3), columns=list('abc'))
  3. In [204]: df
  4. Out[204]:
  5. a b c
  6. 0 0.438921 0.118680 0.863670
  7. 1 0.138138 0.577363 0.686602
  8. 2 0.595307 0.564592 0.520630
  9. 3 0.913052 0.926075 0.616184
  10. 4 0.078718 0.854477 0.898725
  11. 5 0.076404 0.523211 0.591538
  12. 6 0.792342 0.216974 0.564056
  13. 7 0.397890 0.454131 0.915716
  14. 8 0.074315 0.437913 0.019794
  15. 9 0.559209 0.502065 0.026437
  16. # pure python
  17. In [205]: df[(df.a < df.b) & (df.b < df.c)]
  18. Out[205]:
  19. a b c
  20. 1 0.138138 0.577363 0.686602
  21. 4 0.078718 0.854477 0.898725
  22. 5 0.076404 0.523211 0.591538
  23. 7 0.397890 0.454131 0.915716
  24. # query
  25. In [206]: df.query('(a < b) & (b < c)')
  26. Out[206]:
  27. a b c
  28. 1 0.138138 0.577363 0.686602
  29. 4 0.078718 0.854477 0.898725
  30. 5 0.076404 0.523211 0.591538
  31. 7 0.397890 0.454131 0.915716

Do the same thing but fall back on a named index if there is no column with the name a.

  1. In [207]: df = pd.DataFrame(np.random.randint(n / 2, size=(n, 2)), columns=list('bc'))
  2. In [208]: df.index.name = 'a'
  3. In [209]: df
  4. Out[209]:
  5. b c
  6. a
  7. 0 0 4
  8. 1 0 1
  9. 2 3 4
  10. 3 4 3
  11. 4 1 4
  12. 5 0 3
  13. 6 0 1
  14. 7 3 4
  15. 8 2 3
  16. 9 1 1
  17. In [210]: df.query('a < b and b < c')
  18. Out[210]:
  19. b c
  20. a
  21. 2 3 4

If instead you don’t want to or cannot name your index, you can use the name index in your query expression:

  1. In [211]: df = pd.DataFrame(np.random.randint(n, size=(n, 2)), columns=list('bc'))
  2. In [212]: df
  3. Out[212]:
  4. b c
  5. 0 3 1
  6. 1 3 0
  7. 2 5 6
  8. 3 5 2
  9. 4 7 4
  10. 5 0 1
  11. 6 2 5
  12. 7 0 1
  13. 8 6 0
  14. 9 7 9
  15. In [213]: df.query('index < b < c')
  16. Out[213]:
  17. b c
  18. 2 5 6

Note: If the name of your index overlaps with a column name, the column name is given precedence. For example,

  1. In [214]: df = pd.DataFrame({'a': np.random.randint(5, size=5)})
  2. In [215]: df.index.name = 'a'
  3. In [216]: df.query('a > 2') # uses the column 'a', not the index
  4. Out[216]:
  5. a
  6. a
  7. 1 3
  8. 3 3

You can still use the index in a query expression by using the special identifier ‘index’:

  1. In [217]: df.query('index > 2')
  2. Out[217]:
  3. a
  4. a
  5. 3 3
  6. 4 2

If for some reason you have a column named index, then you can refer to the index as ilevel_0 as well, but at this point you should consider renaming your columns to something less ambiguous.

MultiIndex query() Syntax

You can also use the levels of a DataFrame with a MultiIndex as if they were columns in the frame:

  1. In [218]: n = 10
  2. In [219]: colors = np.random.choice(['red', 'green'], size=n)
  3. In [220]: foods = np.random.choice(['eggs', 'ham'], size=n)
  4. In [221]: colors
  5. Out[221]:
  6. array(['red', 'red', 'red', 'green', 'green', 'green', 'green', 'green',
  7. 'green', 'green'],
  8. dtype='<U5')
  9. In [222]: foods
  10. Out[222]:
  11. array(['ham', 'ham', 'eggs', 'eggs', 'eggs', 'ham', 'ham', 'eggs', 'eggs',
  12. 'eggs'],
  13. dtype='<U4')
  14. In [223]: index = pd.MultiIndex.from_arrays([colors, foods], names=['color', 'food'])
  15. In [224]: df = pd.DataFrame(np.random.randn(n, 2), index=index)
  16. In [225]: df
  17. Out[225]:
  18. 0 1
  19. color food
  20. red ham 0.194889 -0.381994
  21. ham 0.318587 2.089075
  22. eggs -0.728293 -0.090255
  23. green eggs -0.748199 1.318931
  24. eggs -2.029766 0.792652
  25. ham 0.461007 -0.542749
  26. ham -0.305384 -0.479195
  27. eggs 0.095031 -0.270099
  28. eggs -0.707140 -0.773882
  29. eggs 0.229453 0.304418
  30. In [226]: df.query('color == "red"')
  31. Out[226]:
  32. 0 1
  33. color food
  34. red ham 0.194889 -0.381994
  35. ham 0.318587 2.089075
  36. eggs -0.728293 -0.090255

If the levels of the MultiIndex are unnamed, you can refer to them using special names:

  1. In [227]: df.index.names = [None, None]
  2. In [228]: df
  3. Out[228]:
  4. 0 1
  5. red ham 0.194889 -0.381994
  6. ham 0.318587 2.089075
  7. eggs -0.728293 -0.090255
  8. green eggs -0.748199 1.318931
  9. eggs -2.029766 0.792652
  10. ham 0.461007 -0.542749
  11. ham -0.305384 -0.479195
  12. eggs 0.095031 -0.270099
  13. eggs -0.707140 -0.773882
  14. eggs 0.229453 0.304418
  15. In [229]: df.query('ilevel_0 == "red"')
  16. Out[229]:
  17. 0 1
  18. red ham 0.194889 -0.381994
  19. ham 0.318587 2.089075
  20. eggs -0.728293 -0.090255

The convention is ilevel_0, which means “index level 0” for the 0th level of the index.

query() Use Cases

A use case for query() is when you have a collection of DataFrame objects that have a subset of column names (or index levels/names) in common. You can pass the same query to both frames without having to specify which frame you’re interested in querying

  1. In [230]: df = pd.DataFrame(np.random.rand(n, 3), columns=list('abc'))
  2. In [231]: df
  3. Out[231]:
  4. a b c
  5. 0 0.224283 0.736107 0.139168
  6. 1 0.302827 0.657803 0.713897
  7. 2 0.611185 0.136624 0.984960
  8. 3 0.195246 0.123436 0.627712
  9. 4 0.618673 0.371660 0.047902
  10. 5 0.480088 0.062993 0.185760
  11. 6 0.568018 0.483467 0.445289
  12. 7 0.309040 0.274580 0.587101
  13. 8 0.258993 0.477769 0.370255
  14. 9 0.550459 0.840870 0.304611
  15. In [232]: df2 = pd.DataFrame(np.random.rand(n + 2, 3), columns=df.columns)
  16. In [233]: df2
  17. Out[233]:
  18. a b c
  19. 0 0.357579 0.229800 0.596001
  20. 1 0.309059 0.957923 0.965663
  21. 2 0.123102 0.336914 0.318616
  22. 3 0.526506 0.323321 0.860813
  23. 4 0.518736 0.486514 0.384724
  24. 5 0.190804 0.505723 0.614533
  25. 6 0.891939 0.623977 0.676639
  26. 7 0.480559 0.378528 0.460858
  27. 8 0.420223 0.136404 0.141295
  28. 9 0.732206 0.419540 0.604675
  29. 10 0.604466 0.848974 0.896165
  30. 11 0.589168 0.920046 0.732716
  31. In [234]: expr = '0.0 <= a <= c <= 0.5'
  32. In [235]: map(lambda frame: frame.query(expr), [df, df2])
  33. Out[235]: <map at 0x7f20f7b679e8>

query() Python versus pandas Syntax Comparison

Full numpy-like syntax:

  1. In [236]: df = pd.DataFrame(np.random.randint(n, size=(n, 3)), columns=list('abc'))
  2. In [237]: df
  3. Out[237]:
  4. a b c
  5. 0 7 8 9
  6. 1 1 0 7
  7. 2 2 7 2
  8. 3 6 2 2
  9. 4 2 6 3
  10. 5 3 8 2
  11. 6 1 7 2
  12. 7 5 1 5
  13. 8 9 8 0
  14. 9 1 5 0
  15. In [238]: df.query('(a < b) & (b < c)')
  16. Out[238]:
  17. a b c
  18. 0 7 8 9
  19. In [239]: df[(df.a < df.b) & (df.b < df.c)]
  20. Out[239]:
  21. a b c
  22. 0 7 8 9

Slightly nicer by removing the parentheses (by binding making comparison operators bind tighter than & and |).

  1. In [240]: df.query('a < b & b < c')
  2. Out[240]:
  3. a b c
  4. 0 7 8 9

Use English instead of symbols:

  1. In [241]: df.query('a < b and b < c')
  2. Out[241]:
  3. a b c
  4. 0 7 8 9

Pretty close to how you might write it on paper:

  1. In [242]: df.query('a < b < c')
  2. Out[242]:
  3. a b c
  4. 0 7 8 9

The in and not in operators

query() also supports special use of Python’s in and not in comparison operators, providing a succinct syntax for calling the isin method of a Series or DataFrame.

  1. # get all rows where columns "a" and "b" have overlapping values
  2. In [243]: df = pd.DataFrame({'a': list('aabbccddeeff'), 'b': list('aaaabbbbcccc'),
  3. .....: 'c': np.random.randint(5, size=12),
  4. .....: 'd': np.random.randint(9, size=12)})
  5. .....:
  6. In [244]: df
  7. Out[244]:
  8. a b c d
  9. 0 a a 2 6
  10. 1 a a 4 7
  11. 2 b a 1 6
  12. 3 b a 2 1
  13. 4 c b 3 6
  14. 5 c b 0 2
  15. 6 d b 3 3
  16. 7 d b 2 1
  17. 8 e c 4 3
  18. 9 e c 2 0
  19. 10 f c 0 6
  20. 11 f c 1 2
  21. In [245]: df.query('a in b')
  22. Out[245]:
  23. a b c d
  24. 0 a a 2 6
  25. 1 a a 4 7
  26. 2 b a 1 6
  27. 3 b a 2 1
  28. 4 c b 3 6
  29. 5 c b 0 2
  30. # How you'd do it in pure Python
  31. In [246]: df[df.a.isin(df.b)]
  32. Out[246]:
  33. a b c d
  34. 0 a a 2 6
  35. 1 a a 4 7
  36. 2 b a 1 6
  37. 3 b a 2 1
  38. 4 c b 3 6
  39. 5 c b 0 2
  40. In [247]: df.query('a not in b')
  41. Out[247]:
  42. a b c d
  43. 6 d b 3 3
  44. 7 d b 2 1
  45. 8 e c 4 3
  46. 9 e c 2 0
  47. 10 f c 0 6
  48. 11 f c 1 2
  49. # pure Python
  50. In [248]: df[~df.a.isin(df.b)]
  51. Out[248]:
  52. a b c d
  53. 6 d b 3 3
  54. 7 d b 2 1
  55. 8 e c 4 3
  56. 9 e c 2 0
  57. 10 f c 0 6
  58. 11 f c 1 2

You can combine this with other expressions for very succinct queries:

  1. # rows where cols a and b have overlapping values and col c's values are less than col d's
  2. In [249]: df.query('a in b and c < d')
  3. Out[249]:
  4. a b c d
  5. 0 a a 2 6
  6. 1 a a 4 7
  7. 2 b a 1 6
  8. 4 c b 3 6
  9. 5 c b 0 2
  10. # pure Python
  11. In [250]: df[df.b.isin(df.a) & (df.c < df.d)]
  12. Out[250]:
  13. a b c d
  14. 0 a a 2 6
  15. 1 a a 4 7
  16. 2 b a 1 6
  17. 4 c b 3 6
  18. 5 c b 0 2
  19. 10 f c 0 6
  20. 11 f c 1 2

Note: Note that in and not in are evaluated in Python, since numexpr has no equivalent of this operation. However, only the in/not in expression itself is evaluated in vanilla Python. For example, in the expression

  1. df.query('a in b + c + d')

(b + c + d) is evaluated by numexpr and then the in operation is evaluated in plain Python. In general, any operations that can be evaluated using numexpr will be.

Special use of the == operator with list objects

Comparing a list of values to a column using ==/!= works similarly to in/not in.

  1. In [251]: df.query('b == ["a", "b", "c"]')
  2. Out[251]:
  3. a b c d
  4. 0 a a 2 6
  5. 1 a a 4 7
  6. 2 b a 1 6
  7. 3 b a 2 1
  8. 4 c b 3 6
  9. 5 c b 0 2
  10. 6 d b 3 3
  11. 7 d b 2 1
  12. 8 e c 4 3
  13. 9 e c 2 0
  14. 10 f c 0 6
  15. 11 f c 1 2
  16. # pure Python
  17. In [252]: df[df.b.isin(["a", "b", "c"])]
  18. Out[252]:
  19. a b c d
  20. 0 a a 2 6
  21. 1 a a 4 7
  22. 2 b a 1 6
  23. 3 b a 2 1
  24. 4 c b 3 6
  25. 5 c b 0 2
  26. 6 d b 3 3
  27. 7 d b 2 1
  28. 8 e c 4 3
  29. 9 e c 2 0
  30. 10 f c 0 6
  31. 11 f c 1 2
  32. In [253]: df.query('c == [1, 2]')
  33. Out[253]:
  34. a b c d
  35. 0 a a 2 6
  36. 2 b a 1 6
  37. 3 b a 2 1
  38. 7 d b 2 1
  39. 9 e c 2 0
  40. 11 f c 1 2
  41. In [254]: df.query('c != [1, 2]')
  42. Out[254]:
  43. a b c d
  44. 1 a a 4 7
  45. 4 c b 3 6
  46. 5 c b 0 2
  47. 6 d b 3 3
  48. 8 e c 4 3
  49. 10 f c 0 6
  50. # using in/not in
  51. In [255]: df.query('[1, 2] in c')
  52. Out[255]:
  53. a b c d
  54. 0 a a 2 6
  55. 2 b a 1 6
  56. 3 b a 2 1
  57. 7 d b 2 1
  58. 9 e c 2 0
  59. 11 f c 1 2
  60. In [256]: df.query('[1, 2] not in c')
  61. Out[256]:
  62. a b c d
  63. 1 a a 4 7
  64. 4 c b 3 6
  65. 5 c b 0 2
  66. 6 d b 3 3
  67. 8 e c 4 3
  68. 10 f c 0 6
  69. # pure Python
  70. In [257]: df[df.c.isin([1, 2])]
  71. Out[257]:
  72. a b c d
  73. 0 a a 2 6
  74. 2 b a 1 6
  75. 3 b a 2 1
  76. 7 d b 2 1
  77. 9 e c 2 0
  78. 11 f c 1 2

Boolean Operators

You can negate boolean expressions with the word not or the ~ operator.

  1. In [258]: df = pd.DataFrame(np.random.rand(n, 3), columns=list('abc'))
  2. In [259]: df['bools'] = np.random.rand(len(df)) > 0.5
  3. In [260]: df.query('~bools')
  4. Out[260]:
  5. a b c bools
  6. 2 0.697753 0.212799 0.329209 False
  7. 7 0.275396 0.691034 0.826619 False
  8. 8 0.190649 0.558748 0.262467 False
  9. In [261]: df.query('not bools')
  10. Out[261]:
  11. a b c bools
  12. 2 0.697753 0.212799 0.329209 False
  13. 7 0.275396 0.691034 0.826619 False
  14. 8 0.190649 0.558748 0.262467 False
  15. In [262]: df.query('not bools') == df[~df.bools]
  16. Out[262]:
  17. a b c bools
  18. 2 True True True True
  19. 7 True True True True
  20. 8 True True True True

Of course, expressions can be arbitrarily complex too:

  1. # short query syntax
  2. In [263]: shorter = df.query('a < b < c and (not bools) or bools > 2')
  3. # equivalent in pure Python
  4. In [264]: longer = df[(df.a < df.b) & (df.b < df.c) & (~df.bools) | (df.bools > 2)]
  5. In [265]: shorter
  6. Out[265]:
  7. a b c bools
  8. 7 0.275396 0.691034 0.826619 False
  9. In [266]: longer
  10. Out[266]:
  11. a b c bools
  12. 7 0.275396 0.691034 0.826619 False
  13. In [267]: shorter == longer
  14. Out[267]:
  15. a b c bools
  16. 7 True True True True

Performance of query()

DataFrame.query() using numexpr is slightly faster than Python for large frames.

pandas速度对比

Note: You will only see the performance benefits of using the numexpr engine with DataFrame.query() if your frame has more than approximately 200,000 rows.

pandas速度对比2

This plot was created using a DataFrame with 3 columns each containing floating point values generated using numpy.random.randn().