Developer

This section will focus on downstream applications of pandas.

Storing pandas DataFrame objects in Apache Parquet format

The Apache Parquet formatprovides key-value metadata at the file and column level, stored in the footerof the Parquet file:

  1. 5: optional list<KeyValue> key_value_metadata

where KeyValue is

  1. struct KeyValue {
  2. 1: required string key
  3. 2: optional string value
  4. }

So that a pandas.DataFrame can be faithfully reconstructed, we store apandas metadata key in the FileMetaData with the value stored as :

  1. {'index_columns': ['__index_level_0__', '__index_level_1__', ...],
  2. 'column_indexes': [<ci0>, <ci1>, ..., <ciN>],
  3. 'columns': [<c0>, <c1>, ...],
  4. 'pandas_version': $VERSION}

Here, <c0>/<ci0> and so forth are dictionaries containing the metadatafor each column, including the index columns. This has JSON form:

  1. {'name': column_name,
  2. 'field_name': parquet_column_name,
  3. 'pandas_type': pandas_type,
  4. 'numpy_type': numpy_type,
  5. 'metadata': metadata}

Note

Every index column is stored with a name matching the patternindexlevel\d+ and its corresponding column information is can befound with the following code snippet.

Following this naming convention isn’t strictly necessary, but stronglysuggested for compatibility with Arrow.

Here’s an example of how the index metadata is structured in pyarrow:

  1. # assuming there's at least 3 levels in the indexindexcolumns = metadata['indexcolumns'] # noqa: F821columns = metadata['columns'] # noqa: F821ith_index = 2assert index_columns[ith_index] == '__index_level_2'ith_index_info = columns[-len(index_columns):][ith_index]ith_index_level_name = ith_index_info['name']

pandas_type is the logical type of the column, and is one of:

  • Boolean: 'bool'
  • Integers: 'int8', 'int16', 'int32', 'int64', 'uint8', 'uint16', 'uint32', 'uint64'
  • Floats: 'float16', 'float32', 'float64'
  • Date and Time Types: 'datetime', 'datetimetz', 'timedelta'
  • String: 'unicode', 'bytes'
  • Categorical: 'categorical'
  • Other Python objects: 'object'

The numpy_type is the physical storage type of the column, which is theresult of str(dtype) for the underlying NumPy array that holds the data. Sofor datetimetz this is datetime64[ns] and for categorical, it may beany of the supported integer categorical types.

The metadata field is None except for:

  • datetimetz: {'timezone': zone, 'unit': 'ns'}, e.g. {'timezone','America/New_York', 'unit': 'ns'}. The 'unit' is optional, and ifomitted it is assumed to be nanoseconds.

  • categorical: {'num_categories': K, 'ordered': is_ordered, 'type': $TYPE}

  • Here 'type' is optional, and can be a nested pandas type specificationhere (but not categorical)
  • unicode: {'encoding': encoding}
  • The encoding is optional, and if not present is UTF-8
  • object: {'encoding': encoding}. Objects can be serialized and storedin BYTE_ARRAY Parquet columns. The encoding can be one of:
  • 'pickle'
  • 'msgpack'
  • 'bson'
  • 'json'
  • timedelta: {'unit': 'ns'}. The 'unit' is optional, and if omittedit is assumed to be nanoseconds. This metadata is optional altogether

For types other than these, the 'metadata' key can beomitted. Implementations can assume None if the key is not present.

As an example of fully-formed metadata:

  1. {'index_columns': ['__index_level_0__'],
  2. 'column_indexes': [
  3. {'name': None,
  4. 'field_name': 'None',
  5. 'pandas_type': 'unicode',
  6. 'numpy_type': 'object',
  7. 'metadata': {'encoding': 'UTF-8'}}
  8. ],
  9. 'columns': [
  10. {'name': 'c0',
  11. 'field_name': 'c0',
  12. 'pandas_type': 'int8',
  13. 'numpy_type': 'int8',
  14. 'metadata': None},
  15. {'name': 'c1',
  16. 'field_name': 'c1',
  17. 'pandas_type': 'bytes',
  18. 'numpy_type': 'object',
  19. 'metadata': None},
  20. {'name': 'c2',
  21. 'field_name': 'c2',
  22. 'pandas_type': 'categorical',
  23. 'numpy_type': 'int16',
  24. 'metadata': {'num_categories': 1000, 'ordered': False}},
  25. {'name': 'c3',
  26. 'field_name': 'c3',
  27. 'pandas_type': 'datetimetz',
  28. 'numpy_type': 'datetime64[ns]',
  29. 'metadata': {'timezone': 'America/Los_Angeles'}},
  30. {'name': 'c4',
  31. 'field_name': 'c4',
  32. 'pandas_type': 'object',
  33. 'numpy_type': 'object',
  34. 'metadata': {'encoding': 'pickle'}},
  35. {'name': None,
  36. 'field_name': '__index_level_0__',
  37. 'pandas_type': 'int64',
  38. 'numpy_type': 'int64',
  39. 'metadata': None}
  40. ],
  41. 'pandas_version': '0.20.0'}