数据帧(DataFrame)

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object. Like Series, DataFrame accepts many different kinds of input:

  • Dict of 1D ndarrays, lists, dicts, or Series
  • 2-D numpy.ndarray
  • Structured or record ndarray
  • A Series
  • Another DataFrame

Along with the data, you can optionally pass index (row labels) and columns (column labels) arguments. If you pass an index and / or columns, you are guaranteeing the index and / or columns of the resulting DataFrame. Thus, a dict of Series plus a specific index will discard all data not matching up to the passed index.

If axis labels are not passed, they will be constructed from the input data based on common sense rules.

Note: When the data is a dict, and columns is not specified, the DataFrame columns will be ordered by the dict’s insertion order, if you are using Python version >= 3.6 and Pandas >= 0.23. If you are using Python < 3.6 or Pandas < 0.23, and columns is not specified, the DataFrame columns will be the lexically ordered list of dict keys.

From dict of Series or dicts

The resulting index will be the union of the indexes of the various Series. If there are any nested dicts, these will first be converted to Series. If no columns are passed, the columns will be the ordered list of dict keys.

  1. In [34]: d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
  2. ....: 'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
  3. ....:
  4. In [35]: df = pd.DataFrame(d)
  5. In [36]: df
  6. Out[36]:
  7. one two
  8. a 1.0 1.0
  9. b 2.0 2.0
  10. c 3.0 3.0
  11. d NaN 4.0
  12. In [37]: pd.DataFrame(d, index=['d', 'b', 'a'])
  13. Out[37]:
  14. one two
  15. d NaN 4.0
  16. b 2.0 2.0
  17. a 1.0 1.0
  18. In [38]: pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])
  19. Out[38]:
  20. two three
  21. d 4.0 NaN
  22. b 2.0 NaN
  23. a 1.0 NaN

The row and column labels can be accessed respectively by accessing the index and columns attributes:

Note: When a particular set of columns is passed along with a dict of data, the passed columns override the keys in the dict.

  1. In [39]: df.index
  2. Out[39]: Index(['a', 'b', 'c', 'd'], dtype='object')
  3. In [40]: df.columns
  4. Out[40]: Index(['one', 'two'], dtype='object')

From dict of ndarrays / lists

The ndarrays must all be the same length. If an index is passed, it must clearly also be the same length as the arrays. If no index is passed, the result will be range(n), where n is the array length.

  1. In [41]: d = {'one' : [1., 2., 3., 4.],
  2. ....: 'two' : [4., 3., 2., 1.]}
  3. ....:
  4. In [42]: pd.DataFrame(d)
  5. Out[42]:
  6. one two
  7. 0 1.0 4.0
  8. 1 2.0 3.0
  9. 2 3.0 2.0
  10. 3 4.0 1.0
  11. In [43]: pd.DataFrame(d, index=['a', 'b', 'c', 'd'])
  12. Out[43]:
  13. one two
  14. a 1.0 4.0
  15. b 2.0 3.0
  16. c 3.0 2.0
  17. d 4.0 1.0

From structured or record array

This case is handled identically to a dict of arrays.

  1. In [44]: data = np.zeros((2,), dtype=[('A', 'i4'),('B', 'f4'),('C', 'a10')])
  2. In [45]: data[:] = [(1,2.,'Hello'), (2,3.,"World")]
  3. In [46]: pd.DataFrame(data)
  4. Out[46]:
  5. A B C
  6. 0 1 2.0 b'Hello'
  7. 1 2 3.0 b'World'
  8. In [47]: pd.DataFrame(data, index=['first', 'second'])
  9. Out[47]:
  10. A B C
  11. first 1 2.0 b'Hello'
  12. second 2 3.0 b'World'
  13. In [48]: pd.DataFrame(data, columns=['C', 'A', 'B'])
  14. Out[48]:
  15. C A B
  16. 0 b'Hello' 1 2.0
  17. 1 b'World' 2 3.0

Note:DataFrame is not intended to work exactly like a 2-dimensional NumPy ndarray.

From a list of dicts

  1. In [49]: data2 = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]
  2. In [50]: pd.DataFrame(data2)
  3. Out[50]:
  4. a b c
  5. 0 1 2 NaN
  6. 1 5 10 20.0
  7. In [51]: pd.DataFrame(data2, index=['first', 'second'])
  8. Out[51]:
  9. a b c
  10. first 1 2 NaN
  11. second 5 10 20.0
  12. In [52]: pd.DataFrame(data2, columns=['a', 'b'])
  13. Out[52]:
  14. a b
  15. 0 1 2
  16. 1 5 10

From a dict of tuples

You can automatically create a multi-indexed frame by passing a tuples dictionary.

  1. In [53]: pd.DataFrame({('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2},
  2. ....: ('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4},
  3. ....: ('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6},
  4. ....: ('b', 'a'): {('A', 'C'): 7, ('A', 'B'): 8},
  5. ....: ('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}})
  6. ....:
  7. Out[53]:
  8. a b
  9. b a c a b
  10. A B 1.0 4.0 5.0 8.0 10.0
  11. C 2.0 3.0 6.0 7.0 NaN
  12. D NaN NaN NaN NaN 9.0

From a Series

The result will be a DataFrame with the same index as the input Series, and with one column whose name is the original name of the Series (only if no other column name provided).

Missing Data

Much more will be said on this topic in the Missing data section. To construct a DataFrame with missing data, we use np.nan to represent missing values. Alternatively, you may pass a numpy.MaskedArray as the data argument to the DataFrame constructor, and its masked entries will be considered missing.

Alternate Constructors

DataFrame.from_dict

DataFrame.from_dict takes a dict of dicts or a dict of array-like sequences and returns a DataFrame. It operates like the DataFrame constructor except for the orient parameter which is 'columns' by default, but which can be set to 'index' in order to use the dict keys as row labels.

  1. In [54]: pd.DataFrame.from_dict(dict([('A', [1, 2, 3]), ('B', [4, 5, 6])]))
  2. Out[54]:
  3. A B
  4. 0 1 4
  5. 1 2 5
  6. 2 3 6

If you pass orient='index', the keys will be the row labels. In this case, you can also pass the desired column names:

  1. In [55]: pd.DataFrame.from_dict(dict([('A', [1, 2, 3]), ('B', [4, 5, 6])]),
  2. ....: orient='index', columns=['one', 'two', 'three'])
  3. ....:
  4. Out[55]:
  5. one two three
  6. A 1 2 3
  7. B 4 5 6

DataFrame.from_records

DataFrame.from_records takes a list of tuples or an ndarray with structured dtype. It works analogously to the normal DataFrame constructor, except that the resulting DataFrame index may be a specific field of the structured dtype. For example:

  1. In [56]: data
  2. Out[56]:
  3. array([(1, 2., b'Hello'), (2, 3., b'World')],
  4. dtype=[('A', '<i4'), ('B', '<f4'), ('C', 'S10')])
  5. In [57]: pd.DataFrame.from_records(data, index='C')
  6. Out[57]:
  7. A B
  8. C
  9. b'Hello' 1 2.0
  10. b'World' 2 3.0

Column selection, addition, deletion

You can treat a DataFrame semantically like a dict of like-indexed Series objects. Getting, setting, and deleting columns works with the same syntax as the analogous dict operations:

  1. In [58]: df['one']
  2. Out[58]:
  3. a 1.0
  4. b 2.0
  5. c 3.0
  6. d NaN
  7. Name: one, dtype: float64
  8. In [59]: df['three'] = df['one'] * df['two']
  9. In [60]: df['flag'] = df['one'] > 2
  10. In [61]: df
  11. Out[61]:
  12. one two three flag
  13. a 1.0 1.0 1.0 False
  14. b 2.0 2.0 4.0 False
  15. c 3.0 3.0 9.0 True
  16. d NaN 4.0 NaN False

Columns can be deleted or popped like with a dict:

  1. In [62]: del df['two']
  2. In [63]: three = df.pop('three')
  3. In [64]: df
  4. Out[64]:
  5. one flag
  6. a 1.0 False
  7. b 2.0 False
  8. c 3.0 True
  9. d NaN False

When inserting a scalar value, it will naturally be propagated to fill the column:

  1. In [65]: df['foo'] = 'bar'
  2. In [66]: df
  3. Out[66]:
  4. one flag foo
  5. a 1.0 False bar
  6. b 2.0 False bar
  7. c 3.0 True bar
  8. d NaN False bar

When inserting a Series that does not have the same index as the DataFrame, it will be conformed to the DataFrame’s index:

  1. In [67]: df['one_trunc'] = df['one'][:2]
  2. In [68]: df
  3. Out[68]:
  4. one flag foo one_trunc
  5. a 1.0 False bar 1.0
  6. b 2.0 False bar 2.0
  7. c 3.0 True bar NaN
  8. d NaN False bar NaN

You can insert raw ndarrays but their length must match the length of the DataFrame’s index.

By default, columns get inserted at the end. The insert function is available to insert at a particular location in the columns:

  1. In [69]: df.insert(1, 'bar', df['one'])
  2. In [70]: df
  3. Out[70]:
  4. one bar flag foo one_trunc
  5. a 1.0 1.0 False bar 1.0
  6. b 2.0 2.0 False bar 2.0
  7. c 3.0 3.0 True bar NaN
  8. d NaN NaN False bar NaN

Assigning New Columns in Method Chains

Inspired by dplyr’s mutate verb, DataFrame has an assign() method that allows you to easily create new columns that are potentially derived from existing columns.

  1. In [71]: iris = pd.read_csv('data/iris.data')
  2. In [72]: iris.head()
  3. Out[72]:
  4. SepalLength SepalWidth PetalLength PetalWidth Name
  5. 0 5.1 3.5 1.4 0.2 Iris-setosa
  6. 1 4.9 3.0 1.4 0.2 Iris-setosa
  7. 2 4.7 3.2 1.3 0.2 Iris-setosa
  8. 3 4.6 3.1 1.5 0.2 Iris-setosa
  9. 4 5.0 3.6 1.4 0.2 Iris-setosa
  10. In [73]: (iris.assign(sepal_ratio = iris['SepalWidth'] / iris['SepalLength'])
  11. ....: .head())
  12. ....:
  13. Out[73]:
  14. SepalLength SepalWidth PetalLength PetalWidth Name sepal_ratio
  15. 0 5.1 3.5 1.4 0.2 Iris-setosa 0.6863
  16. 1 4.9 3.0 1.4 0.2 Iris-setosa 0.6122
  17. 2 4.7 3.2 1.3 0.2 Iris-setosa 0.6809
  18. 3 4.6 3.1 1.5 0.2 Iris-setosa 0.6739
  19. 4 5.0 3.6 1.4 0.2 Iris-setosa 0.7200

In the example above, we inserted a precomputed value. We can also pass in a function of one argument to be evaluated on the DataFrame being assigned to.

  1. In [74]: iris.assign(sepal_ratio = lambda x: (x['SepalWidth'] /
  2. ....: x['SepalLength'])).head()
  3. ....:
  4. Out[74]:
  5. SepalLength SepalWidth PetalLength PetalWidth Name sepal_ratio
  6. 0 5.1 3.5 1.4 0.2 Iris-setosa 0.6863
  7. 1 4.9 3.0 1.4 0.2 Iris-setosa 0.6122
  8. 2 4.7 3.2 1.3 0.2 Iris-setosa 0.6809
  9. 3 4.6 3.1 1.5 0.2 Iris-setosa 0.6739
  10. 4 5.0 3.6 1.4 0.2 Iris-setosa 0.7200

assign always returns a copy of the data, leaving the original DataFrame untouched.

Passing a callable, as opposed to an actual value to be inserted, is useful when you don’t have a reference to the DataFrame at hand. This is common when using assign in a chain of operations. For example, we can limit the DataFrame to just those observations with a Sepal Length greater than 5, calculate the ratio, and plot:

  1. In [75]: (iris.query('SepalLength > 5')
  2. ....: .assign(SepalRatio = lambda x: x.SepalWidth / x.SepalLength,
  3. ....: PetalRatio = lambda x: x.PetalWidth / x.PetalLength)
  4. ....: .plot(kind='scatter', x='SepalRatio', y='PetalRatio'))
  5. ....:
  6. Out[75]: <matplotlib.axes._subplots.AxesSubplot at 0x7f210fb001d0>

散点图

Since a function is passed in, the function is computed on the DataFrame being assigned to. Importantly, this is the DataFrame that’s been filtered to those rows with sepal length greater than 5. The filtering happens first, and then the ratio calculations. This is an example where we didn’t have a reference to the filtered DataFrame available.

The function signature for assign is simply **kwargs. The keys are the column names for the new fields, and the values are either a value to be inserted (for example, a Series or NumPy array), or a function of one argument to be called on the DataFrame. A copy of the original DataFrame is returned, with the new values inserted.

Changed in version 0.23.0.

Starting with Python 3.6 the order of **kwargs is preserved. This allows for dependent assignment, where an expression later in **kwargs can refer to a column created earlier in the same assign().

  1. In [76]: dfa = pd.DataFrame({"A": [1, 2, 3],
  2. ....: "B": [4, 5, 6]})
  3. ....:
  4. In [77]: dfa.assign(C=lambda x: x['A'] + x['B'],
  5. ....: D=lambda x: x['A'] + x['C'])
  6. ....:
  7. Out[77]:
  8. A B C D
  9. 0 1 4 5 6
  10. 1 2 5 7 9
  11. 2 3 6 9 12

In the second expression, x['C'] will refer to the newly created column, that’s equal to dfa['A'] + dfa['B'].

To write code compatible with all versions of Python, split the assignment in two.

  1. In [78]: dependent = pd.DataFrame({"A": [1, 1, 1]})
  2. In [79]: (dependent.assign(A=lambda x: x['A'] + 1)
  3. ....: .assign(B=lambda x: x['A'] + 2))
  4. ....:
  5. Out[79]:
  6. A B
  7. 0 2 4
  8. 1 2 4
  9. 2 2 4

!Warning

Dependent assignment maybe subtly change the behavior of your code between Python 3.6 and older versions of Python. If you wish write code that supports versions of python before and after 3.6, you’ll need to take care when passing assign expressions that

  • Updating an existing column
  • Referring to the newly updated column in the same assign For example, we’ll update column “A” and then refer to it when creating “B”.
  1. >>> dependent = pd.DataFrame({"A": [1, 1, 1]})
  2. >>> dependent.assign(A=lambda x: x["A"] + 1,
  3. B=lambda x: x["A"] + 2)

For Python 3.5 and earlier the expression creating B refers to the “old” value of A, [1, 1, 1]. The output is then

  1. A B
  2. 0 2 3
  3. 1 2 3
  4. 2 2 3

For Python 3.6 and later, the expression creating A refers to the “new” value of A, [2, 2, 2], which results in

  1. A B
  2. 0 2 4
  3. 1 2 4
  4. 2 2 4

Indexing / Selection

The basics of indexing are as follows:

OperationSyntaxResult
Select columndf[col]Series
Select row by labeldf.loc[label]Series
Select row by integer locationdf.iloc[loc]Series
Slice rowsdf[5:10]DataFrame
Select rows by boolean vectordf[bool_vec]DataFrame

Row selection, for example, returns a Series whose index is the columns of the DataFrame:

  1. In [80]: df.loc['b']
  2. Out[80]:
  3. one 2
  4. bar 2
  5. flag False
  6. foo bar
  7. one_trunc 2
  8. Name: b, dtype: object
  9. In [81]: df.iloc[2]
  10. Out[81]:
  11. one 3
  12. bar 3
  13. flag True
  14. foo bar
  15. one_trunc NaN
  16. Name: c, dtype: object

For a more exhaustive treatment of sophisticated label-based indexing and slicing, see the section on indexing. We will address the fundamentals of reindexing / conforming to new sets of labels in the section on reindexing.

Data alignment and arithmetic

Data alignment between DataFrame objects automatically align on both the columns and the index (row labels). Again, the resulting object will have the union of the column and row labels.

  1. In [82]: df = pd.DataFrame(np.random.randn(10, 4), columns=['A', 'B', 'C', 'D'])
  2. In [83]: df2 = pd.DataFrame(np.random.randn(7, 3), columns=['A', 'B', 'C'])
  3. In [84]: df + df2
  4. Out[84]:
  5. A B C D
  6. 0 0.0457 -0.0141 1.3809 NaN
  7. 1 -0.9554 -1.5010 0.0372 NaN
  8. 2 -0.6627 1.5348 -0.8597 NaN
  9. 3 -2.4529 1.2373 -0.1337 NaN
  10. 4 1.4145 1.9517 -2.3204 NaN
  11. 5 -0.4949 -1.6497 -1.0846 NaN
  12. 6 -1.0476 -0.7486 -0.8055 NaN
  13. 7 NaN NaN NaN NaN
  14. 8 NaN NaN NaN NaN
  15. 9 NaN NaN NaN NaN

When doing an operation between DataFrame and Series, the default behavior is to align the Series index on the DataFrame columns, thus broadcasting row-wise. For example:

  1. In [85]: df - df.iloc[0]
  2. Out[85]:
  3. A B C D
  4. 0 0.0000 0.0000 0.0000 0.0000
  5. 1 -1.3593 -0.2487 -0.4534 -1.7547
  6. 2 0.2531 0.8297 0.0100 -1.9912
  7. 3 -1.3111 0.0543 -1.7249 -1.6205
  8. 4 0.5730 1.5007 -0.6761 1.3673
  9. 5 -1.7412 0.7820 -1.2416 -2.0531
  10. 6 -1.2408 -0.8696 -0.1533 0.0004
  11. 7 -0.7439 0.4110 -0.9296 -0.2824
  12. 8 -1.1949 1.3207 0.2382 -1.4826
  13. 9 2.2938 1.8562 0.7733 -1.4465

In the special case of working with time series data, and the DataFrame index also contains dates, the broadcasting will be column-wise:

  1. In [86]: index = pd.date_range('1/1/2000', periods=8)
  2. In [87]: df = pd.DataFrame(np.random.randn(8, 3), index=index, columns=list('ABC'))
  3. In [88]: df
  4. Out[88]:
  5. A B C
  6. 2000-01-01 -1.2268 0.7698 -1.2812
  7. 2000-01-02 -0.7277 -0.1213 -0.0979
  8. 2000-01-03 0.6958 0.3417 0.9597
  9. 2000-01-04 -1.1103 -0.6200 0.1497
  10. 2000-01-05 -0.7323 0.6877 0.1764
  11. 2000-01-06 0.4033 -0.1550 0.3016
  12. 2000-01-07 -2.1799 -1.3698 -0.9542
  13. 2000-01-08 1.4627 -1.7432 -0.8266
  14. In [89]: type(df['A'])
  15. Out[89]: pandas.core.series.Series
  16. In [90]: df - df['A']
  17. Out[90]:
  18. 2000-01-01 00:00:00 2000-01-02 00:00:00 2000-01-03 00:00:00 \
  19. 2000-01-01 NaN NaN NaN
  20. 2000-01-02 NaN NaN NaN
  21. 2000-01-03 NaN NaN NaN
  22. 2000-01-04 NaN NaN NaN
  23. 2000-01-05 NaN NaN NaN
  24. 2000-01-06 NaN NaN NaN
  25. 2000-01-07 NaN NaN NaN
  26. 2000-01-08 NaN NaN NaN
  27. 2000-01-04 00:00:00 ... 2000-01-08 00:00:00 A B C
  28. 2000-01-01 NaN ... NaN NaN NaN NaN
  29. 2000-01-02 NaN ... NaN NaN NaN NaN
  30. 2000-01-03 NaN ... NaN NaN NaN NaN
  31. 2000-01-04 NaN ... NaN NaN NaN NaN
  32. 2000-01-05 NaN ... NaN NaN NaN NaN
  33. 2000-01-06 NaN ... NaN NaN NaN NaN
  34. 2000-01-07 NaN ... NaN NaN NaN NaN
  35. 2000-01-08 NaN ... NaN NaN NaN NaN
  36. [8 rows x 11 columns]

!Warning

  1. df - df['A']

is now deprecated and will be removed in a future release. The preferred way to replicate this behavior is

  1. df.sub(df['A'], axis=0)

For explicit control over the matching and broadcasting behavior, see the section on flexible binary operations.

Operations with scalars are just as you would expect:

  1. In [91]: df * 5 + 2
  2. Out[91]:
  3. A B C
  4. 2000-01-01 -4.1341 5.8490 -4.4062
  5. 2000-01-02 -1.6385 1.3935 1.5106
  6. 2000-01-03 5.4789 3.7087 6.7986
  7. 2000-01-04 -3.5517 -1.0999 2.7487
  8. 2000-01-05 -1.6617 5.4387 2.8822
  9. 2000-01-06 4.0165 1.2252 3.5081
  10. 2000-01-07 -8.8993 -4.8492 -2.7710
  11. 2000-01-08 9.3135 -6.7158 -2.1330
  12. In [92]: 1 / df
  13. Out[92]:
  14. A B C
  15. 2000-01-01 -0.8151 1.2990 -0.7805
  16. 2000-01-02 -1.3742 -8.2436 -10.2163
  17. 2000-01-03 1.4372 2.9262 1.0420
  18. 2000-01-04 -0.9006 -1.6130 6.6779
  19. 2000-01-05 -1.3655 1.4540 5.6675
  20. 2000-01-06 2.4795 -6.4537 3.3154
  21. 2000-01-07 -0.4587 -0.7300 -1.0480
  22. 2000-01-08 0.6837 -0.5737 -1.2098
  23. In [93]: df ** 4
  24. Out[93]:
  25. A B C
  26. 2000-01-01 2.2653 0.3512 2.6948e+00
  27. 2000-01-02 0.2804 0.0002 9.1796e-05
  28. 2000-01-03 0.2344 0.0136 8.4838e-01
  29. 2000-01-04 1.5199 0.1477 5.0286e-04
  30. 2000-01-05 0.2876 0.2237 9.6924e-04
  31. 2000-01-06 0.0265 0.0006 8.2769e-03
  32. 2000-01-07 22.5795 3.5212 8.2903e-01
  33. 2000-01-08 4.5774 9.2332 4.6683e-01

Boolean operators work as well:

  1. In [94]: df1 = pd.DataFrame({'a' : [1, 0, 1], 'b' : [0, 1, 1] }, dtype=bool)
  2. In [95]: df2 = pd.DataFrame({'a' : [0, 1, 1], 'b' : [1, 1, 0] }, dtype=bool)
  3. In [96]: df1 & df2
  4. Out[96]:
  5. a b
  6. 0 False False
  7. 1 False True
  8. 2 True False
  9. In [97]: df1 | df2
  10. Out[97]:
  11. a b
  12. 0 True True
  13. 1 True True
  14. 2 True True
  15. In [98]: df1 ^ df2
  16. Out[98]:
  17. a b
  18. 0 True True
  19. 1 True False
  20. 2 False True
  21. In [99]: -df1
  22. Out[99]:
  23. a b
  24. 0 False True
  25. 1 True False
  26. 2 False False

Transposing

To transpose, access the T attribute (also the transpose function), similar to an ndarray:

  1. # only show the first 5 rows
  2. In [100]: df[:5].T
  3. Out[100]:
  4. 2000-01-01 2000-01-02 2000-01-03 2000-01-04 2000-01-05
  5. A -1.2268 -0.7277 0.6958 -1.1103 -0.7323
  6. B 0.7698 -0.1213 0.3417 -0.6200 0.6877
  7. C -1.2812 -0.0979 0.9597 0.1497 0.1764

DataFrame interoperability with NumPy functions

Elementwise NumPy ufuncs (log, exp, sqrt, …) and various other NumPy functions can be used with no issues on DataFrame, assuming the data within are numeric:

  1. In [101]: np.exp(df)
  2. Out[101]:
  3. A B C
  4. 2000-01-01 0.2932 2.1593 0.2777
  5. 2000-01-02 0.4830 0.8858 0.9068
  6. 2000-01-03 2.0053 1.4074 2.6110
  7. 2000-01-04 0.3294 0.5380 1.1615
  8. 2000-01-05 0.4808 1.9892 1.1930
  9. 2000-01-06 1.4968 0.8565 1.3521
  10. 2000-01-07 0.1131 0.2541 0.3851
  11. 2000-01-08 4.3176 0.1750 0.4375
  12. In [102]: np.asarray(df)
  13. Out[102]:
  14. array([[-1.2268, 0.7698, -1.2812],
  15. [-0.7277, -0.1213, -0.0979],
  16. [ 0.6958, 0.3417, 0.9597],
  17. [-1.1103, -0.62 , 0.1497],
  18. [-0.7323, 0.6877, 0.1764],
  19. [ 0.4033, -0.155 , 0.3016],
  20. [-2.1799, -1.3698, -0.9542],
  21. [ 1.4627, -1.7432, -0.8266]])

The dot method on DataFrame implements matrix multiplication:

  1. In [103]: df.T.dot(df)
  2. Out[103]:
  3. A B C
  4. A 11.3419 -0.0598 3.0080
  5. B -0.0598 6.5206 2.0833
  6. C 3.0080 2.0833 4.3105

Similarly, the dot method on Series implements dot product:

  1. In [104]: s1 = pd.Series(np.arange(5,10))
  2. In [105]: s1.dot(s1)
  3. Out[105]: 255

DataFrame is not intended to be a drop-in replacement for ndarray as its indexing semantics are quite different in places from a matrix.

Console display

Very large DataFrames will be truncated to display them in the console. You can also get a summary using info(). (Here I am reading a CSV version of the baseball dataset from the plyr R package):

  1. In [106]: baseball = pd.read_csv('data/baseball.csv')
  2. In [107]: print(baseball)
  3. id player year stint ... hbp sh sf gidp
  4. 0 88641 womacto01 2006 2 ... 0.0 3.0 0.0 0.0
  5. 1 88643 schilcu01 2006 1 ... 0.0 0.0 0.0 0.0
  6. .. ... ... ... ... ... ... ... ... ...
  7. 98 89533 aloumo01 2007 1 ... 2.0 0.0 3.0 13.0
  8. 99 89534 alomasa02 2007 1 ... 0.0 0.0 0.0 0.0
  9. [100 rows x 23 columns]
  10. In [108]: baseball.info()
  11. <class 'pandas.core.frame.DataFrame'>
  12. RangeIndex: 100 entries, 0 to 99
  13. Data columns (total 23 columns):
  14. id 100 non-null int64
  15. player 100 non-null object
  16. year 100 non-null int64
  17. stint 100 non-null int64
  18. team 100 non-null object
  19. lg 100 non-null object
  20. g 100 non-null int64
  21. ab 100 non-null int64
  22. r 100 non-null int64
  23. h 100 non-null int64
  24. X2b 100 non-null int64
  25. X3b 100 non-null int64
  26. hr 100 non-null int64
  27. rbi 100 non-null float64
  28. sb 100 non-null float64
  29. cs 100 non-null float64
  30. bb 100 non-null int64
  31. so 100 non-null float64
  32. ibb 100 non-null float64
  33. hbp 100 non-null float64
  34. sh 100 non-null float64
  35. sf 100 non-null float64
  36. gidp 100 non-null float64
  37. dtypes: float64(9), int64(11), object(3)
  38. memory usage: 18.0+ KB

However, using to_string will return a string representation of the DataFrame in tabular form, though it won’t always fit the console width:

  1. In [109]: print(baseball.iloc[-20:, :12].to_string())
  2. id player year stint team lg g ab r h X2b X3b
  3. 80 89474 finlest01 2007 1 COL NL 43 94 9 17 3 0
  4. 81 89480 embreal01 2007 1 OAK AL 4 0 0 0 0 0
  5. 82 89481 edmonji01 2007 1 SLN NL 117 365 39 92 15 2
  6. 83 89482 easleda01 2007 1 NYN NL 76 193 24 54 6 0
  7. 84 89489 delgaca01 2007 1 NYN NL 139 538 71 139 30 0
  8. 85 89493 cormirh01 2007 1 CIN NL 6 0 0 0 0 0
  9. 86 89494 coninje01 2007 2 NYN NL 21 41 2 8 2 0
  10. 87 89495 coninje01 2007 1 CIN NL 80 215 23 57 11 1
  11. 88 89497 clemero02 2007 1 NYA AL 2 2 0 1 0 0
  12. 89 89498 claytro01 2007 2 BOS AL 8 6 1 0 0 0
  13. 90 89499 claytro01 2007 1 TOR AL 69 189 23 48 14 0
  14. 91 89501 cirilje01 2007 2 ARI NL 28 40 6 8 4 0
  15. 92 89502 cirilje01 2007 1 MIN AL 50 153 18 40 9 2
  16. 93 89521 bondsba01 2007 1 SFN NL 126 340 75 94 14 0
  17. 94 89523 biggicr01 2007 1 HOU NL 141 517 68 130 31 3
  18. 95 89525 benitar01 2007 2 FLO NL 34 0 0 0 0 0
  19. 96 89526 benitar01 2007 1 SFN NL 19 0 0 0 0 0
  20. 97 89530 ausmubr01 2007 1 HOU NL 117 349 38 82 16 3
  21. 98 89533 aloumo01 2007 1 NYN NL 87 328 51 112 19 1
  22. 99 89534 alomasa02 2007 1 NYN NL 8 22 1 3 1 0

Wide DataFrames will be printed across multiple rows by default:

  1. In [110]: pd.DataFrame(np.random.randn(3, 12))
  2. Out[110]:
  3. 0 1 2 3 4 5 6 7 8 9 10 11
  4. 0 -0.345352 1.314232 0.690579 0.995761 2.396780 0.014871 3.357427 -0.317441 -1.236269 0.896171 -0.487602 -0.082240
  5. 1 -2.182937 0.380396 0.084844 0.432390 1.519970 -0.493662 0.600178 0.274230 0.132885 -0.023688 2.410179 1.450520
  6. 2 0.206053 -0.251905 -2.213588 1.063327 1.266143 0.299368 -0.863838 0.408204 -1.048089 -0.025747 -0.988387 0.094055

You can change how much to print on a single row by setting the display.width option:

  1. In [111]: pd.set_option('display.width', 40) # default is 80
  2. In [112]: pd.DataFrame(np.random.randn(3, 12))
  3. Out[112]:
  4. 0 1 2 3 4 5 6 7 8 9 10 11
  5. 0 1.262731 1.289997 0.082423 -0.055758 0.536580 -0.489682 0.369374 -0.034571 -2.484478 -0.281461 0.030711 0.109121
  6. 1 1.126203 -0.977349 1.474071 -0.064034 -1.282782 0.781836 -1.071357 0.441153 2.353925 0.583787 0.221471 -0.744471
  7. 2 0.758527 1.729689 -0.964980 -0.845696 -1.340896 1.846883 -1.328865 1.682706 -1.717693 0.888782 0.228440 0.901805

You can adjust the max width of the individual columns by setting display.max_colwidth

  1. In [113]: datafile={'filename': ['filename_01','filename_02'],
  2. .....: 'path': ["media/user_name/storage/folder_01/filename_01",
  3. .....: "media/user_name/storage/folder_02/filename_02"]}
  4. .....:
  5. In [114]: pd.set_option('display.max_colwidth',30)
  6. In [115]: pd.DataFrame(datafile)
  7. Out[115]:
  8. filename path
  9. 0 filename_01 media/user_name/storage/fo...
  10. 1 filename_02 media/user_name/storage/fo...
  11. In [116]: pd.set_option('display.max_colwidth',100)
  12. In [117]: pd.DataFrame(datafile)
  13. Out[117]:
  14. filename path
  15. 0 filename_01 media/user_name/storage/folder_01/filename_01
  16. 1 filename_02 media/user_name/storage/folder_02/filename_02

You can also disable this feature via the expand_frame_repr option. This will print the table in one block.

DataFrame column attribute access and IPython completion

If a DataFrame column label is a valid Python variable name, the column can be accessed like an attribute:

  1. In [118]: df = pd.DataFrame({'foo1' : np.random.randn(5),
  2. .....: 'foo2' : np.random.randn(5)})
  3. .....:
  4. In [119]: df
  5. Out[119]:
  6. foo1 foo2
  7. 0 1.171216 -0.858447
  8. 1 0.520260 0.306996
  9. 2 -1.197071 -0.028665
  10. 3 -1.066969 0.384316
  11. 4 -0.303421 1.574159
  12. In [120]: df.foo1
  13. Out[120]:
  14. 0 1.171216
  15. 1 0.520260
  16. 2 -1.197071
  17. 3 -1.066969
  18. 4 -0.303421
  19. Name: foo1, dtype: float64

The columns are also connected to the IPython completion mechanism so they can be tab-completed:

  1. In [5]: df.fo<TAB>
  2. df.foo1 df.foo2