数据类型

The main types stored in pandas objects are float, int, bool, datetime64[ns] and datetime64[ns, tz], timedelta[ns], category and object. In addition these dtypes have item sizes, e.g. int64 and int32. See Series with TZ for more detail on datetime64[ns, tz] dtypes.

A convenient dtypes attribute for DataFrame returns a Series with the data type of each column.

  1. In [344]: dft = pd.DataFrame(dict(A = np.random.rand(3),
  2. .....: B = 1,
  3. .....: C = 'foo',
  4. .....: D = pd.Timestamp('20010102'),
  5. .....: E = pd.Series([1.0]*3).astype('float32'),
  6. .....: F = False,
  7. .....: G = pd.Series([1]*3,dtype='int8')))
  8. .....:
  9. In [345]: dft
  10. Out[345]:
  11. A B C D E F G
  12. 0 0.809585 1 foo 2001-01-02 1.0 False 1
  13. 1 0.128238 1 foo 2001-01-02 1.0 False 1
  14. 2 0.775752 1 foo 2001-01-02 1.0 False 1
  15. In [346]: dft.dtypes
  16. Out[346]:
  17. A float64
  18. B int64
  19. C object
  20. D datetime64[ns]
  21. E float32
  22. F bool
  23. G int8
  24. dtype: object

On a Series object, use the dtype attribute.

  1. In [347]: dft['A'].dtype
  2. Out[347]: dtype('float64')

If a pandas object contains data with multiple dtypes in a single column, the dtype of the column will be chosen to accommodate all of the data types (object is the most general).

  1. # these ints are coerced to floats
  2. In [348]: pd.Series([1, 2, 3, 4, 5, 6.])
  3. Out[348]:
  4. 0 1.0
  5. 1 2.0
  6. 2 3.0
  7. 3 4.0
  8. 4 5.0
  9. 5 6.0
  10. dtype: float64
  11. # string data forces an ``object`` dtype
  12. In [349]: pd.Series([1, 2, 3, 6., 'foo'])
  13. Out[349]:
  14. 0 1
  15. 1 2
  16. 2 3
  17. 3 6
  18. 4 foo
  19. dtype: object

The number of columns of each type in a DataFrame can be found by calling get_dtype_counts().

  1. In [350]: dft.get_dtype_counts()
  2. Out[350]:
  3. float64 1
  4. float32 1
  5. int64 1
  6. int8 1
  7. datetime64[ns] 1
  8. bool 1
  9. object 1
  10. dtype: int64

Numeric dtypes will propagate and can coexist in DataFrames. If a dtype is passed (either directly via the dtype keyword, a passed ndarray, or a passed Series, then it will be preserved in DataFrame operations. Furthermore, different numeric dtypes will NOT be combined. The following example will give you a taste.

  1. In [351]: df1 = pd.DataFrame(np.random.randn(8, 1), columns=['A'], dtype='float32')
  2. In [352]: df1
  3. Out[352]:
  4. A
  5. 0 0.890400
  6. 1 0.283331
  7. 2 -0.303613
  8. 3 -1.192210
  9. 4 0.065420
  10. 5 0.455918
  11. 6 2.008328
  12. 7 0.188942
  13. In [353]: df1.dtypes
  14. Out[353]:
  15. A float32
  16. dtype: object
  17. In [354]: df2 = pd.DataFrame(dict( A = pd.Series(np.random.randn(8), dtype='float16'),
  18. .....: B = pd.Series(np.random.randn(8)),
  19. .....: C = pd.Series(np.array(np.random.randn(8), dtype='uint8')) ))
  20. .....:
  21. In [355]: df2
  22. Out[355]:
  23. A B C
  24. 0 -0.454346 0.200071 255
  25. 1 -0.916504 -0.557756 255
  26. 2 0.640625 -0.141988 0
  27. 3 2.675781 -0.174060 0
  28. 4 -0.007866 0.258626 0
  29. 5 -0.204224 0.941688 0
  30. 6 -0.100098 -1.849045 0
  31. 7 -0.402100 -0.949458 0
  32. In [356]: df2.dtypes
  33. Out[356]:
  34. A float16
  35. B float64
  36. C uint8
  37. dtype: object

defaults

By default integer types are int64 and float types are float64, regardless of platform (32-bit or 64-bit). The following will all result in int64 dtypes.

  1. In [357]: pd.DataFrame([1, 2], columns=['a']).dtypes
  2. Out[357]:
  3. a int64
  4. dtype: object
  5. In [358]: pd.DataFrame({'a': [1, 2]}).dtypes
  6. Out[358]:
  7. a int64
  8. dtype: object
  9. In [359]: pd.DataFrame({'a': 1 }, index=list(range(2))).dtypes
  10. Out[359]:
  11. a int64
  12. dtype: object

Note that Numpy will choose platform-dependent types when creating arrays. The following WILL result in int32 on 32-bit platform.

  1. In [360]: frame = pd.DataFrame(np.array([1, 2]))

upcasting

Types can potentially be upcasted when combined with other types, meaning they are promoted from the current type (e.g. int to float).

  1. In [361]: df3 = df1.reindex_like(df2).fillna(value=0.0) + df2
  2. In [362]: df3
  3. Out[362]:
  4. A B C
  5. 0 0.436054 0.200071 255.0
  6. 1 -0.633173 -0.557756 255.0
  7. 2 0.337012 -0.141988 0.0
  8. 3 1.483571 -0.174060 0.0
  9. 4 0.057555 0.258626 0.0
  10. 5 0.251695 0.941688 0.0
  11. 6 1.908231 -1.849045 0.0
  12. 7 -0.213158 -0.949458 0.0
  13. In [363]: df3.dtypes
  14. Out[363]:
  15. A float32
  16. B float64
  17. C float64
  18. dtype: object

The values attribute on a DataFrame return the lower-common-denominator of the dtypes, meaning the dtype that can accommodate ALL of the types in the resulting homogeneous dtyped NumPy array. This can force some upcasting.

  1. In [364]: df3.values.dtype
  2. Out[364]: dtype('float64')

astype

You can use the astype() method to explicitly convert dtypes from one to another. These will by default return a copy, even if the dtype was unchanged (pass copy=False to change this behavior). In addition, they will raise an exception if the astype operation is invalid.

Upcasting is always according to the numpy rules. If two different dtypes are involved in an operation, then the more general one will be used as the result of the operation.

  1. In [365]: df3
  2. Out[365]:
  3. A B C
  4. 0 0.436054 0.200071 255.0
  5. 1 -0.633173 -0.557756 255.0
  6. 2 0.337012 -0.141988 0.0
  7. 3 1.483571 -0.174060 0.0
  8. 4 0.057555 0.258626 0.0
  9. 5 0.251695 0.941688 0.0
  10. 6 1.908231 -1.849045 0.0
  11. 7 -0.213158 -0.949458 0.0
  12. In [366]: df3.dtypes
  13. Out[366]:
  14. A float32
  15. B float64
  16. C float64
  17. dtype: object
  18. # conversion of dtypes
  19. In [367]: df3.astype('float32').dtypes
  20. Out[367]:
  21. A float32
  22. B float32
  23. C float32
  24. dtype: object

Convert a subset of columns to a specified type using astype().

  1. In [368]: dft = pd.DataFrame({'a': [1,2,3], 'b': [4,5,6], 'c': [7, 8, 9]})
  2. In [369]: dft[['a','b']] = dft[['a','b']].astype(np.uint8)
  3. In [370]: dft
  4. Out[370]:
  5. a b c
  6. 0 1 4 7
  7. 1 2 5 8
  8. 2 3 6 9
  9. In [371]: dft.dtypes
  10. Out[371]:
  11. a uint8
  12. b uint8
  13. c int64
  14. dtype: object

New in version 0.19.0.

Convert certain columns to a specific dtype by passing a dict to astype().

  1. In [372]: dft1 = pd.DataFrame({'a': [1,0,1], 'b': [4,5,6], 'c': [7, 8, 9]})
  2. In [373]: dft1 = dft1.astype({'a': np.bool, 'c': np.float64})
  3. In [374]: dft1
  4. Out[374]:
  5. a b c
  6. 0 True 4 7.0
  7. 1 False 5 8.0
  8. 2 True 6 9.0
  9. In [375]: dft1.dtypes
  10. Out[375]:
  11. a bool
  12. b int64
  13. c float64
  14. dtype: object

Note: When trying to convert a subset of columns to a specified type using astype() and loc(), upcasting occurs.

loc() tries to fit in what we are assigning to the current dtypes, while [] will overwrite them taking the dtype from the right hand side. Therefore the following piece of code produces the unintended result.

  1. In [376]: dft = pd.DataFrame({'a': [1,2,3], 'b': [4,5,6], 'c': [7, 8, 9]})
  2. In [377]: dft.loc[:, ['a', 'b']].astype(np.uint8).dtypes
  3. Out[377]:
  4. a uint8
  5. b uint8
  6. dtype: object
  7. In [378]: dft.loc[:, ['a', 'b']] = dft.loc[:, ['a', 'b']].astype(np.uint8)
  8. In [379]: dft.dtypes
  9. Out[379]:
  10. a int64
  11. b int64
  12. c int64
  13. dtype: object

object conversion

pandas offers various functions to try to force conversion of types from the object dtype to other types. In cases where the data is already of the correct type, but stored in an object array, the DataFrame.infer_objects() and Series.infer_objects() methods can be used to soft convert to the correct type.

  1. In [380]: import datetime
  2. In [381]: df = pd.DataFrame([[1, 2],
  3. .....: ['a', 'b'],
  4. .....: [datetime.datetime(2016, 3, 2), datetime.datetime(2016, 3, 2)]])
  5. .....:
  6. In [382]: df = df.T
  7. In [383]: df
  8. Out[383]:
  9. 0 1 2
  10. 0 1 a 2016-03-02 00:00:00
  11. 1 2 b 2016-03-02 00:00:00
  12. In [384]: df.dtypes
  13. Out[384]:
  14. 0 object
  15. 1 object
  16. 2 object
  17. dtype: object

Because the data was transposed the original inference stored all columns as object, which infer_objects will correct.

  1. In [385]: df.infer_objects().dtypes
  2. Out[385]:
  3. 0 int64
  4. 1 object
  5. 2 datetime64[ns]
  6. dtype: object

The following functions are available for one dimensional object arrays or scalars to perform hard conversion of objects to a specified type:

  • to_numeric() (conversion to numeric dtypes)

    1. In [386]: m = ['1.1', 2, 3]
    2. In [387]: pd.to_numeric(m)
    3. Out[387]: array([ 1.1, 2. , 3. ])
  • to_datetime() (conversion to datetime objects)

    1. In [388]: import datetime
    2. In [389]: m = ['2016-07-09', datetime.datetime(2016, 3, 2)]
    3. In [390]: pd.to_datetime(m)
    4. Out[390]: DatetimeIndex(['2016-07-09', '2016-03-02'], dtype='datetime64[ns]', freq=None)
  • to_timedelta() (conversion to timedelta objects)

    1. In [391]: m = ['5us', pd.Timedelta('1day')]
    2. In [392]: pd.to_timedelta(m)
    3. Out[392]: TimedeltaIndex(['0 days 00:00:00.000005', '1 days 00:00:00'], dtype='timedelta64[ns]', freq=None)

To force a conversion, we can pass in an errors argument, which specifies how pandas should deal with elements that cannot be converted to desired dtype or object. By default, errors='raise', meaning that any errors encountered will be raised during the conversion process. However, if errors='coerce', these errors will be ignored and pandas will convert problematic elements to pd.NaT (for datetime and timedelta) or np.nan (for numeric). This might be useful if you are reading in data which is mostly of the desired dtype (e.g. numeric, datetime), but occasionally has non-conforming elements intermixed that you want to represent as missing:

  1. In [393]: import datetime
  2. In [394]: m = ['apple', datetime.datetime(2016, 3, 2)]
  3. In [395]: pd.to_datetime(m, errors='coerce')
  4. Out[395]: DatetimeIndex(['NaT', '2016-03-02'], dtype='datetime64[ns]', freq=None)
  5. In [396]: m = ['apple', 2, 3]
  6. In [397]: pd.to_numeric(m, errors='coerce')
  7. Out[397]: array([ nan, 2., 3.])
  8. In [398]: m = ['apple', pd.Timedelta('1day')]
  9. In [399]: pd.to_timedelta(m, errors='coerce')
  10. Out[399]: TimedeltaIndex([NaT, '1 days'], dtype='timedelta64[ns]', freq=None)

The errors parameter has a third option of errors='ignore', which will simply return the passed in data if it encounters any errors with the conversion to a desired data type:

  1. In [400]: import datetime
  2. In [401]: m = ['apple', datetime.datetime(2016, 3, 2)]
  3. In [402]: pd.to_datetime(m, errors='ignore')
  4. Out[402]: array(['apple', datetime.datetime(2016, 3, 2, 0, 0)], dtype=object)
  5. In [403]: m = ['apple', 2, 3]
  6. In [404]: pd.to_numeric(m, errors='ignore')
  7. Out[404]: array(['apple', 2, 3], dtype=object)
  8. In [405]: m = ['apple', pd.Timedelta('1day')]
  9. In [406]: pd.to_timedelta(m, errors='ignore')
  10. Out[406]: array(['apple', Timedelta('1 days 00:00:00')], dtype=object)

In addition to object conversion, to_numeric() provides another argument downcast, which gives the option of downcasting the newly (or already) numeric data to a smaller dtype, which can conserve memory:

  1. In [407]: m = ['1', 2, 3]
  2. In [408]: pd.to_numeric(m, downcast='integer') # smallest signed int dtype
  3. Out[408]: array([1, 2, 3], dtype=int8)
  4. In [409]: pd.to_numeric(m, downcast='signed') # same as 'integer'
  5. Out[409]: array([1, 2, 3], dtype=int8)
  6. In [410]: pd.to_numeric(m, downcast='unsigned') # smallest unsigned int dtype
  7. Out[410]: array([1, 2, 3], dtype=uint8)
  8. In [411]: pd.to_numeric(m, downcast='float') # smallest float dtype
  9. Out[411]: array([ 1., 2., 3.], dtype=float32)

As these methods apply only to one-dimensional arrays, lists or scalars; they cannot be used directly on multi-dimensional objects such as DataFrames. However, with apply(), we can “apply” the function over each column efficiently:

  1. In [412]: import datetime
  2. In [413]: df = pd.DataFrame([['2016-07-09', datetime.datetime(2016, 3, 2)]] * 2, dtype='O')
  3. In [414]: df
  4. Out[414]:
  5. 0 1
  6. 0 2016-07-09 2016-03-02 00:00:00
  7. 1 2016-07-09 2016-03-02 00:00:00
  8. In [415]: df.apply(pd.to_datetime)
  9. Out[415]:
  10. 0 1
  11. 0 2016-07-09 2016-03-02
  12. 1 2016-07-09 2016-03-02
  13. In [416]: df = pd.DataFrame([['1.1', 2, 3]] * 2, dtype='O')
  14. In [417]: df
  15. Out[417]:
  16. 0 1 2
  17. 0 1.1 2 3
  18. 1 1.1 2 3
  19. In [418]: df.apply(pd.to_numeric)
  20. Out[418]:
  21. 0 1 2
  22. 0 1.1 2 3
  23. 1 1.1 2 3
  24. In [419]: df = pd.DataFrame([['5us', pd.Timedelta('1day')]] * 2, dtype='O')
  25. In [420]: df
  26. Out[420]:
  27. 0 1
  28. 0 5us 1 days 00:00:00
  29. 1 5us 1 days 00:00:00
  30. In [421]: df.apply(pd.to_timedelta)
  31. Out[421]:
  32. 0 1
  33. 0 00:00:00.000005 1 days
  34. 1 00:00:00.000005 1 days

gotchas

Performing selection operations on integer type data can easily upcast the data to floating. The dtype of the input data will be preserved in cases where nans are not introduced. See also Support for integer NA.

  1. In [422]: dfi = df3.astype('int32')
  2. In [423]: dfi['E'] = 1
  3. In [424]: dfi
  4. Out[424]:
  5. A B C E
  6. 0 0 0 255 1
  7. 1 0 0 255 1
  8. 2 0 0 0 1
  9. 3 1 0 0 1
  10. 4 0 0 0 1
  11. 5 0 0 0 1
  12. 6 1 -1 0 1
  13. 7 0 0 0 1
  14. In [425]: dfi.dtypes
  15. Out[425]:
  16. A int32
  17. B int32
  18. C int32
  19. E int64
  20. dtype: object
  21. In [426]: casted = dfi[dfi>0]
  22. In [427]: casted
  23. Out[427]:
  24. A B C E
  25. 0 NaN NaN 255.0 1
  26. 1 NaN NaN 255.0 1
  27. 2 NaN NaN NaN 1
  28. 3 1.0 NaN NaN 1
  29. 4 NaN NaN NaN 1
  30. 5 NaN NaN NaN 1
  31. 6 1.0 NaN NaN 1
  32. 7 NaN NaN NaN 1
  33. In [428]: casted.dtypes
  34. Out[428]:
  35. A float64
  36. B float64
  37. C float64
  38. E int64
  39. dtype: object

While float dtypes are unchanged.

  1. In [429]: dfa = df3.copy()
  2. In [430]: dfa['A'] = dfa['A'].astype('float32')
  3. In [431]: dfa.dtypes
  4. Out[431]:
  5. A float32
  6. B float64
  7. C float64
  8. dtype: object
  9. In [432]: casted = dfa[df2>0]
  10. In [433]: casted
  11. Out[433]:
  12. A B C
  13. 0 NaN 0.200071 255.0
  14. 1 NaN NaN 255.0
  15. 2 0.337012 NaN NaN
  16. 3 1.483571 NaN NaN
  17. 4 NaN 0.258626 NaN
  18. 5 NaN 0.941688 NaN
  19. 6 NaN NaN NaN
  20. 7 NaN NaN NaN
  21. In [434]: casted.dtypes
  22. Out[434]:
  23. A float32
  24. B float64
  25. C float64
  26. dtype: object