Internals

This section will provide a look into some of pandas internals. It’s primarilyintended for developers of pandas itself.

Indexing

In pandas there are a few objects implemented which can serve as validcontainers for the axis labels:

  • Index: the generic “ordered set” object, an ndarray of object dtypeassuming nothing about its contents. The labels must be hashable (andlikely immutable) and unique. Populates a dict of label to location inCython to do O(1) lookups.
  • Int64Index: a version of Index highly optimized for 64-bit integerdata, such as time stamps
  • Float64Index: a version of Index highly optimized for 64-bit float data
  • MultiIndex: the standard hierarchical index object
  • DatetimeIndex: An Index object with Timestamp boxed elements (impl are the int64 values)
  • TimedeltaIndex: An Index object with Timedelta boxed elements (impl are the in64 values)
  • PeriodIndex: An Index object with Period elements

There are functions that make the creation of a regular index easy:

  • date_range: fixed frequency date range generated from a time rule orDateOffset. An ndarray of Python datetime objects
  • period_range: fixed frequency date range generated from a time rule orDateOffset. An ndarray of Period objects, representing timespans

The motivation for having an Index class in the first place was to enabledifferent implementations of indexing. This means that it’s possible for you,the user, to implement a custom Index subclass that may be better suited toa particular application than the ones provided in pandas.

From an internal implementation point of view, the relevant methods that anIndex must define are one or more of the following (depending on howincompatible the new object internals are with the Index functions):

  • get_loc: returns an “indexer” (an integer, or in some cases aslice object) for a label
  • slice_locs: returns the “range” to slice between two labels
  • get_indexer: Computes the indexing vector for reindexing / dataalignment purposes. See the source / docstrings for more on this
  • get_indexer_non_unique: Computes the indexing vector for reindexing / dataalignment purposes when the index is non-unique. See the source / docstringsfor more on this
  • reindex: Does any pre-conversion of the input index then callsget_indexer
  • union, intersection: computes the union or intersection of twoIndex objects
  • insert: Inserts a new label into an Index, yielding a new object
  • delete: Delete a label, yielding a new object
  • drop: Deletes a set of labels
  • take: Analogous to ndarray.take

MultiIndex

Internally, the MultiIndex consists of a few things: the levels, theinteger codes (until version 0.24 named labels), and the level names:

  1. In [1]: index = pd.MultiIndex.from_product([range(3), ['one', 'two']],
  2. ...: names=['first', 'second'])
  3. ...:
  4.  
  5. In [2]: index
  6. Out[2]:
  7. MultiIndex([(0, 'one'),
  8. (0, 'two'),
  9. (1, 'one'),
  10. (1, 'two'),
  11. (2, 'one'),
  12. (2, 'two')],
  13. names=['first', 'second'])
  14.  
  15. In [3]: index.levels
  16. Out[3]: FrozenList([[0, 1, 2], ['one', 'two']])
  17.  
  18. In [4]: index.codes
  19. Out[4]: FrozenList([[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])
  20.  
  21. In [5]: index.names
  22. Out[5]: FrozenList(['first', 'second'])

You can probably guess that the codes determine which unique element isidentified with that location at each layer of the index. It’s important tonote that sortedness is determined solely from the integer codes and doesnot check (or care) whether the levels themselves are sorted. Fortunately, theconstructors from_tuples and from_arrays ensure that this is true, butif you compute the levels and codes yourself, please be careful.

Values

Pandas extends NumPy’s type system with custom types, like Categorical ordatetimes with a timezone, so we have multiple notions of “values”. For 1-Dcontainers (Index classes and Series) we have the following convention:

  • cls.ndarray_values is _always a NumPy ndarray. Ideally,_ndarray_values is cheap to compute. For example, for a Categorical,this returns the codes, not the array of objects.
  • cls._values refers is the “best possible” array. This could be anndarray, ExtensionArray, or in Index subclass (note: we’re in theprocess of removing the index subclasses here so that it’s always anndarray or ExtensionArray).

So, for example, Series[category]._values is a Categorical, whileSeries[category]._ndarray_values is the underlying codes.

Subclassing pandas data structures

This section has been moved to Subclassing pandas data structures.