Development roadmap

Development roadmap

Authors: Stephan Hoyer, Joe Hamman and xarray developers

Date: July 24, 2018

Xarray is an open source Python library for labeled multidimensionalarrays and datasets.

Our philosophy

Why has xarray been successful? In our opinion:

Xarray does a great job of solving specific use-cases formultidimensional data analysis:
- The dominant use-case for xarray is for analysis of griddeddataset in the geosciences, e.g., as part of thePangeo project.
- Xarray is also used more broadly in the physical sciences, wherewe’ve found the needs for analyzing multidimensional datasets areremarkably consistent (e.g., seeSunPy andPlasmaPy).
- Finally, xarray is used in a variety of other domains, includingfinance, probabilisticprogramming andgenomics.
Xarray is also a domain agnostic solution:
- We focus on providing a flexible set of functionality relatedlabeled multidimensional arrays, rather than solving particularproblems.
- This facilitates collaboration between users with different needs,and helps us attract a broad community of contributers.
- Importantly, this retains flexibility, for use cases that don’tfit particularly well into existing frameworks.
Xarray integrates well with other libraries in the scientificPython stack.
- We leverage first-class external libraries for core features ofxarray (e.g., NumPy for ndarrays, pandas for indexing, dask forparallel computing)
- We expose our internal abstractions to users (e.g.,apply_ufunc()), which facilitates extending xarray in variousways.

Together, these features have made xarray a first-class choice forlabeled multidimensional arrays in Python.

We want to double-down on xarray’s strengths by making it an even moreflexible and powerful tool for multidimensional data analysis. We wantto continue to engage xarray’s core geoscience users, and to also reachout to new domains to learn from other successful data models like thoseof yt or the OLAPcube.

Specific needs

The user community has voiced a number specific needs related to howxarray interfaces with domain specific problems. Xarray may not solveall of these issues directly, but these areas provide opportunities forxarray to provide better, more extensible, interfaces. Some examples ofthese common needs are:

Non-regular grids (e.g., staggered and unstructured meshes).
Physical units.
Lazily computed arrays (e.g., for coordinate systems).
New file-formats.

Technical vision

We think the right approach to extending xarray’s user community and theusefulness of the project is to focus on improving key interfaces thatcan be used externally to meet domain-specific needs.

We can generalize the community’s needs into three main catagories:

More flexible grids/indexing.
More flexible arrays/computing.
More flexible storage backends.

Each of these are detailed further in the subsections below.

Flexible indexes

Xarray currently keeps track of indexes associated with coordinates bystoring them in the form of a pandas.Index in specialxarray.IndexVariable objects.

The limitations of this model became clear with the addition ofpandas.MultiIndex support in xarray 0.9, where a single indexcorresponds to multiple xarray variables. MultiIndex support is highlyuseful, but xarray now has numerous special cases to check forMultiIndex levels.

A cleaner model would be to elevate indexes to an explicit part ofxarray’s data model, e.g., as attributes on the Dataset andDataArray classes. Indexes would need to be propagated along withcoordinates in xarray operations, but will no longer would need to havea one-to-one correspondance with coordinate variables. Instead, an indexshould be able to refer to multiple (possibly multidimensional)coordinates that define it. See GH1603 for full details

Specific tasks:

Add an indexes attribute to xarray.Dataset andxarray.Dataset, as dictionaries that map from coordinate names toxarray index objects.
Use the new index interface to write wrappers for pandas.Index,pandas.MultiIndex and scipy.spatial.KDTree.
Expose the interface externally to allow third-party libraries toimplement custom indexing routines, e.g., for geospatial look-ups onthe surface of the Earth.

In addition to the new features it directly enables, this clean up willallow xarray to more easily implement some long-awaited features thatbuild upon indexing, such as groupby operations with multiple variables.

Flexible arrays

Xarray currently supports wrapping multidimensional arrays defined byNumPy, dask and to a limited-extent pandas. It would be nice to haveinterfaces that allow xarray to wrap alternative N-D arrayimplementations, e.g.:

Arrays holding physical units.
Lazily computed arrays.
Other ndarray objects, e.g., sparse, xnd, xtensor.

Our strategy has been to pursue upstream improvements in NumPy (seeNEP-22)for supporting a complete duck-typing interface using with NumPy’shigher level array API. Improvements in NumPy’s support for custom datatypes would also be highly useful for xarray users.

By pursuing these improvements in NumPy we hope to extend the benefitsto the full scientific Python community, and avoid tight couplingbetween xarray and specific third-party libraries (e.g., forimplementing untis). This will allow xarray to maintain its domainagnostic strengths.

We expect that we may eventually add some minimal interfaces in xarrayfor features that we delegate to external array libraries (e.g., forgetting units and changing units). If we do add these features, weexpect them to be thin wrappers, with core functionality implemented bythird-party libraries.

Flexible storage

The xarray backends module has grown in size and complexity. Much ofthis growth has been “organic” and mostly to support incrementaladditions to the supported backends. This has left us with a fragileinternal API that is difficult for even experienced xarray developers touse. Moreover, the lack of a public facing API for building xarraybackends means that users can not easily build backend interface forxarray in third-party libraries.

The idea of refactoring the backends API and exposing it to users wasoriginally proposed in GH1970. The idea wouldbe to develop a well tested and generic backend base class andassociated utilities for external use. Specific tasks for thisdevelopment would include:

Exposing an abstract backend for writing new storage systems.
Exposing utilities for features like automatic closing of files,LRU-caching and explicit/lazy indexing.
Possibly moving some infrequently used backends to third-partypackages.

Engaging more users

Like many open-source projects, the documentation of xarray has growntogether with the library’s features. While we think that the xarraydocumentation is comprehensive already, we acknowledge that the adoptionof xarray might be slowed down because of the substantial timeinvestment required to learn its working principles. In particular,non-computer scientists or users less familiar with the pydata ecosystemmight find it difficult to learn xarray and realize how xarray can helpthem in their daily work.

In order to lower this adoption barrier, we propose to:

Develop entry-level tutorials for users with different backgrounds. Forexample, we would like to develop tutorials for users with or withoutprevious knowledge of pandas, numpy, netCDF, etc. These tutorials may bebuilt as part of xarray’s documentation or included in a separate repositoryto enable interactive use (e.g. mybinder.org).
Document typical user workflows in a dedicated website, following the exampleof dask-stories.
Write a basic glossary that defines terms that might not be familiar to all(e.g. “lazy”, “labeled”, “serialization”, “indexing”, “backend”).

Administrative

Current core developers

Stephan Hoyer
Ryan Abernathey
Joe Hamman
Benoit Bovy
Fabien Maussion
Keisuke Fujii
Maximilian Roos

NumFOCUS

On July 16, 2018, Joe and Stephan submitted xarray’s fiscal sponsorshipapplication to NumFOCUS.