Nullable integer data type

New in version 0.24.0.

Note

IntegerArray is currently experimental. Its API or implementation maychange without warning.

In Working with missing data, we saw that pandas primarily uses NaN to representmissing data. Because NaN is a float, this forces an array of integers withany missing values to become floating point. In some cases, this may not mattermuch. But if your integer column is, say, an identifier, casting to float canbe problematic. Some integers cannot even be represented as floating pointnumbers.

Pandas can represent integer data with possibly missing values usingarrays.IntegerArray. This is an extension typesimplemented within pandas. It is not the default dtype for integers, and will not be inferred;you must explicitly pass the dtype into array() or Series:

  1. In [1]: arr = pd.array([1, 2, np.nan], dtype=pd.Int64Dtype())
  2.  
  3. In [2]: arr
  4. Out[2]:
  5. <IntegerArray>
  6. [1, 2, NaN]
  7. Length: 3, dtype: Int64

Or the string alias "Int64" (note the capital "I", to differentiate fromNumPy’s 'int64' dtype:

  1. In [3]: pd.array([1, 2, np.nan], dtype="Int64")
  2. Out[3]:
  3. <IntegerArray>
  4. [1, 2, NaN]
  5. Length: 3, dtype: Int64

This array can be stored in a DataFrame or Series like anyNumPy array.

  1. In [4]: pd.Series(arr)
  2. Out[4]:
  3. 0 1
  4. 1 2
  5. 2 NaN
  6. dtype: Int64

You can also pass the list-like object to the Series constructorwith the dtype.

  1. In [5]: s = pd.Series([1, 2, np.nan], dtype="Int64")
  2.  
  3. In [6]: s
  4. Out[6]:
  5. 0 1
  6. 1 2
  7. 2 NaN
  8. dtype: Int64

By default (if you don’t specify dtype), NumPy is used, and you’ll endup with a float64 dtype Series:

  1. In [7]: pd.Series([1, 2, np.nan])
  2. Out[7]:
  3. 0 1.0
  4. 1 2.0
  5. 2 NaN
  6. dtype: float64

Operations involving an integer array will behave similar to NumPy arrays.Missing values will be propagated, and and the data will be coerced to anotherdtype if needed.

  1. # arithmetic
  2. In [8]: s + 1
  3. Out[8]:
  4. 0 2
  5. 1 3
  6. 2 NaN
  7. dtype: Int64
  8.  
  9. # comparison
  10. In [9]: s == 1
  11. Out[9]:
  12. 0 True
  13. 1 False
  14. 2 False
  15. dtype: bool
  16.  
  17. # indexing
  18. In [10]: s.iloc[1:3]
  19. Out[10]:
  20. 1 2
  21. 2 NaN
  22. dtype: Int64
  23.  
  24. # operate with other dtypes
  25. In [11]: s + s.iloc[1:3].astype('Int8')
  26. Out[11]:
  27. 0 NaN
  28. 1 4
  29. 2 NaN
  30. dtype: Int64
  31.  
  32. # coerce when needed
  33. In [12]: s + 0.01
  34. Out[12]:
  35. 0 1.01
  36. 1 2.01
  37. 2 NaN
  38. dtype: float64

These dtypes can operate as part of of DataFrame.

  1. In [13]: df = pd.DataFrame({'A': s, 'B': [1, 1, 3], 'C': list('aab')})
  2.  
  3. In [14]: df
  4. Out[14]:
  5. A B C
  6. 0 1 1 a
  7. 1 2 1 a
  8. 2 NaN 3 b
  9.  
  10. In [15]: df.dtypes
  11. Out[15]:
  12. A Int64
  13. B int64
  14. C object
  15. dtype: object

These dtypes can be merged & reshaped & casted.

  1. In [16]: pd.concat([df[['A']], df[['B', 'C']]], axis=1).dtypes
  2. Out[16]:
  3. A Int64
  4. B int64
  5. C object
  6. dtype: object
  7.  
  8. In [17]: df['A'].astype(float)
  9. Out[17]:
  10. 0 1.0
  11. 1 2.0
  12. 2 NaN
  13. Name: A, dtype: float64

Reduction and groupby operations such as ‘sum’ work as well.

  1. In [18]: df.sum()
  2. Out[18]:
  3. A 3
  4. B 5
  5. C aab
  6. dtype: object
  7.  
  8. In [19]: df.groupby('B').A.sum()
  9. Out[19]:
  10. B
  11. 1 3
  12. 3 0
  13. Name: A, dtype: Int64