Reading and writing files

xarray supports direct serialization and IO to several file formats, fromsimple Pickle files to the more flexible netCDFformat (recommended).

Pickle

The simplest way to serialize an xarray object is to use Python’s built-in picklemodule:

  1. In [1]: import pickle
  2.  
  3. In [2]: ds = xr.Dataset({'foo': (('x', 'y'), np.random.rand(4, 5))},
  4. ...: coords={'x': [10, 20, 30, 40],
  5. ...: 'y': pd.date_range('2000-01-01', periods=5),
  6. ...: 'z': ('x', list('abcd'))})
  7. ...:
  8.  
  9. # use the highest protocol (-1) because it is way faster than the default
  10. # text based pickle format
  11. In [3]: pkl = pickle.dumps(ds, protocol=-1)
  12.  
  13. In [4]: pickle.loads(pkl)
  14. Out[4]:
  15. <xarray.Dataset>
  16. Dimensions: (x: 4, y: 5)
  17. Coordinates:
  18. * x (x) int64 10 20 30 40
  19. * y (y) datetime64[ns] 2000-01-01 2000-01-02 ... 2000-01-04 2000-01-05
  20. z (x) <U1 'a' 'b' 'c' 'd'
  21. Data variables:
  22. foo (x, y) float64 0.127 0.9667 0.2605 0.8972 ... 0.7768 0.5948 0.1376

Pickling is important because it doesn’t require any external librariesand lets you use xarray objects with Python modules likemultiprocessing or Dask. However, pickling isnot recommended for long-term storage.

Restoring a pickle requires that the internal structure of the types for thepickled data remain unchanged. Because the internal design of xarray is stillbeing refined, we make no guarantees (at this point) that objects pickled withthis version of xarray will work in future versions.

Note

When pickling an object opened from a NetCDF file, the pickle file willcontain a reference to the file on disk. If you want to store the actualarray values, load it into memory first with load()or compute().

Dictionary

We can convert a Dataset (or a DataArray) to a dict usingto_dict():

  1. In [5]: d = ds.to_dict()
  2.  
  3. In [6]: d
  4. Out[6]:
  5. {'coords': {'x': {'dims': ('x',), 'attrs': {}, 'data': [10, 20, 30, 40]},
  6. 'y': {'dims': ('y',),
  7. 'attrs': {},
  8. 'data': [datetime.datetime(2000, 1, 1, 0, 0),
  9. datetime.datetime(2000, 1, 2, 0, 0),
  10. datetime.datetime(2000, 1, 3, 0, 0),
  11. datetime.datetime(2000, 1, 4, 0, 0),
  12. datetime.datetime(2000, 1, 5, 0, 0)]},
  13. 'z': {'dims': ('x',), 'attrs': {}, 'data': ['a', 'b', 'c', 'd']}},
  14. 'attrs': {},
  15. 'dims': {'x': 4, 'y': 5},
  16. 'data_vars': {'foo': {'dims': ('x', 'y'),
  17. 'attrs': {},
  18. 'data': [[0.12696983303810094,
  19. 0.966717838482003,
  20. 0.26047600586578334,
  21. 0.8972365243645735,
  22. 0.37674971618967135],
  23. [0.33622174433445307,
  24. 0.45137647047539964,
  25. 0.8402550832613813,
  26. 0.12310214428849964,
  27. 0.5430262020470384],
  28. [0.37301222522143085,
  29. 0.4479968246859435,
  30. 0.12944067971751294,
  31. 0.8598787065799693,
  32. 0.8203883631195572],
  33. [0.35205353914802473,
  34. 0.2288873043216132,
  35. 0.7767837505077176,
  36. 0.5947835894851238,
  37. 0.1375535565632705]]}}}

We can create a new xarray object from a dict usingfrom_dict():

  1. In [7]: ds_dict = xr.Dataset.from_dict(d)
  2.  
  3. In [8]: ds_dict
  4. Out[8]:
  5. <xarray.Dataset>
  6. Dimensions: (x: 4, y: 5)
  7. Coordinates:
  8. * x (x) int64 10 20 30 40
  9. * y (y) datetime64[ns] 2000-01-01 2000-01-02 ... 2000-01-04 2000-01-05
  10. z (x) <U1 'a' 'b' 'c' 'd'
  11. Data variables:
  12. foo (x, y) float64 0.127 0.9667 0.2605 0.8972 ... 0.7768 0.5948 0.1376

Dictionary support allows for flexible use of xarray objects. It doesn’trequire external libraries and dicts can easily be pickled, or converted tojson, or geojson. All the values are converted to lists, so dicts mightbe quite large.

To export just the dataset schema, without the data itself, use thedata=False option:

  1. In [9]: ds.to_dict(data=False)
  2. Out[9]:
  3. {'coords': {'x': {'dims': ('x',),
  4. 'attrs': {},
  5. 'dtype': 'int64',
  6. 'shape': (4,)},
  7. 'y': {'dims': ('y',), 'attrs': {}, 'dtype': 'datetime64[ns]', 'shape': (5,)},
  8. 'z': {'dims': ('x',), 'attrs': {}, 'dtype': '<U1', 'shape': (4,)}},
  9. 'attrs': {},
  10. 'dims': {'x': 4, 'y': 5},
  11. 'data_vars': {'foo': {'dims': ('x', 'y'),
  12. 'attrs': {},
  13. 'dtype': 'float64',
  14. 'shape': (4, 5)}}}

This can be useful for generating indices of dataset contents to expose tosearch indices or other automated data discovery tools.

netCDF

The recommended way to store xarray data structures is netCDF, whichis a binary file format for self-described datasets that originatedin the geosciences. xarray is based on the netCDF data model, so netCDF fileson disk directly correspond to Dataset objects.

NetCDF is supported on almost all platforms, and parsers existfor the vast majority of scientific programming languages. Recent versions ofnetCDF are based on the even more widely used HDF5 file-format.

Tip

If you aren’t familiar with this data format, the netCDF FAQ is a goodplace to start.

Reading and writing netCDF files with xarray requires scipy or thenetCDF4-Python library to be installed (the later is required toread/write netCDF V4 files and use the compression options described below).

We can save a Dataset to disk using theDataset.to_netcdf method:

  1. In [10]: ds.to_netcdf('saved_on_disk.nc')

By default, the file is saved as netCDF4 (assuming netCDF4-Python isinstalled). You can control the format and engine used to write the file withthe format and engine arguments.

We can load netCDF files to create a new Dataset usingopen_dataset():

  1. In [11]: ds_disk = xr.open_dataset('saved_on_disk.nc')
  2.  
  3. In [12]: ds_disk
  4. Out[12]:
  5. <xarray.Dataset>
  6. Dimensions: (x: 4, y: 5)
  7. Coordinates:
  8. * x (x) int64 10 20 30 40
  9. * y (y) datetime64[ns] 2000-01-01 2000-01-02 ... 2000-01-04 2000-01-05
  10. z (x) object ...
  11. Data variables:
  12. foo (x, y) float64 ...

Similarly, a DataArray can be saved to disk using theDataArray.to_netcdf method, and loadedfrom disk using the open_dataarray() function. As netCDF filescorrespond to Dataset objects, these functions internallyconvert the DataArray to a Dataset before saving, and then convert backwhen loading, ensuring that the DataArray that is loaded is always exactlythe same as the one that was saved.

A dataset can also be loaded or written to a specific group within a netCDFfile. To load from a group, pass a group keyword argument to theopen_dataset function. The group can be specified as a path-likestring, e.g., to access subgroup ‘bar’ within group ‘foo’ pass‘/foo/bar’ as the group argument. When writing multiple groups in one file,pass mode='a' to to_netcdf to ensure that each call does not delete thefile.

Data is always loaded lazily from netCDF files. You can manipulate, slice and subsetDataset and DataArray objects, and no array values are loaded into memory untilyou try to perform some sort of actual computation. For an example of how theselazy arrays work, see the OPeNDAP section below.

It is important to note that when you modify values of a Dataset, even onelinked to files on disk, only the in-memory copy you are manipulating in xarrayis modified: the original file on disk is never touched.

Tip

xarray’s lazy loading of remote or on-disk datasets is often but not alwaysdesirable. Before performing computationally intense operations, it isoften a good idea to load a Dataset (or DataArray) entirely into memory byinvoking the load() method.

Datasets have a close() method to close the associatednetCDF file. However, it’s often cleaner to use a with statement:

  1. # this automatically closes the dataset after use
  2. In [13]: with xr.open_dataset('saved_on_disk.nc') as ds:
  3. ....: print(ds.keys())
  4. ....:
  5. KeysView(<xarray.Dataset>
  6. Dimensions: (x: 4, y: 5)
  7. Coordinates:
  8. * x (x) int64 10 20 30 40
  9. * y (y) datetime64[ns] 2000-01-01 2000-01-02 ... 2000-01-04 2000-01-05
  10. z (x) object ...
  11. Data variables:
  12. foo (x, y) float64 ...)

Although xarray provides reasonable support for incremental reads of files ondisk, it does not support incremental writes, which can be a useful strategyfor dealing with datasets too big to fit into memory. Instead, xarray integrateswith dask.array (see Parallel computing with Dask), which provides a fully featured engine forstreaming computation.

It is possible to append or overwrite netCDF variables using the mode='a'argument. When using this option, all variables in the dataset will be writtento the original netCDF file, regardless if they exist in the original dataset.

Reading encoded data

NetCDF files follow some conventions for encoding datetime arrays (as numberswith a “units” attribute) and for packing and unpacking data (asdescribed by the “scale_factor” and “add_offset” attributes). If the argumentdecode_cf=True (default) is given to open_dataset, xarray will attemptto automatically decode the values in the netCDF objects according toCF conventions. Sometimes this will fail, for example, if a variablehas an invalid “units” or “calendar” attribute. For these cases, you canturn this decoding off manually.

You can view this encoding information (among others) in theDataArray.encoding andDataArray.encoding attributes:

  1. In [14]: ds_disk['y'].encoding
  2. Out[14]:
  3. {'zlib': False,
  4. 'shuffle': False,
  5. 'complevel': 0,
  6. 'fletcher32': False,
  7. 'contiguous': True,
  8. 'chunksizes': None,
  9. 'source': 'saved_on_disk.nc',
  10. 'original_shape': (5,),
  11. 'dtype': dtype('int64'),
  12. 'units': 'days since 2000-01-01 00:00:00',
  13. 'calendar': 'proleptic_gregorian'}
  14.  
  15. In [15]: ds_disk.encoding
  16. Out[15]:
  17. {'unlimited_dims': set(),
  18. 'source': 'saved_on_disk.nc'}

Note that all operations that manipulate variables other than indexingwill remove encoding information.

Writing encoded data

Conversely, you can customize how xarray writes netCDF files on disk byproviding explicit encodings for each dataset variable. The encodingargument takes a dictionary with variable names as keys and variable specificencodings as values. These encodings are saved as attributes on the netCDFvariables on disk, which allows xarray to faithfully read encoded data back intomemory.

It is important to note that using encodings is entirely optional: if you do notsupply any of these encoding options, xarray will write data to disk using adefault encoding, or the options in the encoding attribute, if set.This works perfectly fine in most cases, but encoding can be useful foradditional control, especially for enabling compression.

In the file on disk, these encodings as saved as attributes on each variable, whichallow xarray and other CF-compliant tools for working with netCDF files to correctlyread the data.

Scaling and type conversions

These encoding options work on any version of the netCDF file format:

  • dtype: Any valid NumPy dtype or string convertable to a dtype, e.g., 'int16'or 'float32'. This controls the type of the data written on disk.

  • _FillValue: Values of NaN in xarray variables are remapped to this value whensaved on disk. This is important when converting floating point with missing valuesto integers on disk, because NaN is not a valid value for integer dtypes. As adefault, variables with float types are attributed a _FillValue of NaN in theoutput file, unless explicitly disabled with an encoding {'_FillValue': None}.

  • scale_factor and add_offset: Used to convert from encoded data on disk toto the decoded data in memory, according to the formuladecoded = scale_factor * encoded + add_offset.

These parameters can be fruitfully combined to compress discretized data on disk. Forexample, to save the variable foo with a precision of 0.1 in 16-bit integers whileconverting NaN to -9999, we would useencoding={'foo': {'dtype': 'int16', 'scale_factor': 0.1, '_FillValue': -9999}}.Compression and decompression with such discretization is extremely fast.

String encoding

xarray can write unicode strings to netCDF files in two ways:

  • As variable length strings. This is only supported on netCDF4 (HDF5) files.

  • By encoding strings into bytes, and writing encoded bytes as a characterarray. The default encoding is UTF-8.

By default, we use variable length strings for compatible files and fall-backto using encoded character arrays. Character arrays can be selected even fornetCDF4 files by setting the dtype field in encoding to S1(corresponding to NumPy’s single-character bytes dtype).

If character arrays are used:

  • The string encoding that was used is stored ondisk in the _Encoding attribute, which matches an ad-hoc conventionadopted by the netCDF4-Python library.At the time of this writing (October 2017), a standard convention for indicatingstring encoding for character arrays in netCDF files wasstill under discussion.Technically, you can useany string encoding recognized by Python if you feel the need to deviate from UTF-8,by setting the _Encoding field in encoding. Butwe don’t recommend it.

  • The character dimension name can be specifed by the char_dim_name field of a variable’sencoding. If this is not specified the default name for the character dimension is'string%s' % data.shape[-1]. When decoding character arrays from existing files, thechar_dim_name is added to the variables encoding to preserve if encoding happens, butthe field can be edited by the user.

Warning

Missing values in bytes or unicode string arrays (represented by NaN inxarray) are currently written to disk as empty strings ''. This meansmissing values will not be restored when data is loaded from disk.This behavior is likely to change in the future (GH1647).Unfortunately, explicitly setting a _FillValue for string arrays to handlemissing values doesn’t work yet either, though we also hope to fix this in thefuture.

Chunk based compression

zlib, complevel, fletcher32, continguous and chunksizescan be used for enabling netCDF4/HDF5’s chunk based compression, as describedin the documentation for createVariable for netCDF4-Python. This only worksfor netCDF4 files and thus requires using format='netCDF4' and eitherengine='netcdf4' or engine='h5netcdf'.

Chunk based gzip compression can yield impressive space savings, especiallyfor sparse data, but it comes with significant performance overhead. HDF5libraries can only read complete chunks back into memory, and maximumdecompression speed is in the range of 50-100 MB/s. Worse, HDF5’s compressionand decompression currently cannot be parallelized with dask. For these reasons, werecommend trying discretization based compression (described above) first.

Time units

The units and calendar attributes control how xarray serializes datetime64 andtimedelta64 arrays to datasets on disk as numeric values. The units encodingshould be a string like 'days since 1900-01-01' for datetime64 data or a stringlike 'days' for timedelta64 data. calendar should be one of the calendar typessupported by netCDF4-python: ‘standard’, ‘gregorian’, ‘proleptic_gregorian’ ‘noleap’,‘365_day’, ‘360_day’, ‘julian’, ‘all_leap’, ‘366_day’.

By default, xarray uses the ‘proleptic_gregorian’ calendar and units of the smallest timedifference between values, with a reference time of the first time value.

Iris

The Iris tool allows easy reading of common meteorological and climate model formats(including GRIB and UK MetOffice PP files) into Cube objects which are in many ways verysimilar to DataArray objects, while enforcing a CF-compliant data model. If iris isinstalled xarray can convert a DataArray into a Cube usingto_iris():

  1. In [16]: da = xr.DataArray(np.random.rand(4, 5), dims=['x', 'y'],
  2. ....: coords=dict(x=[10, 20, 30, 40],
  3. ....: y=pd.date_range('2000-01-01', periods=5)))
  4. ....:
  5.  
  6. In [17]: cube = da.to_iris()
  7.  
  8. In [18]: cube
  9. Out[18]: <iris 'Cube' of unknown / (unknown) (x: 4; y: 5)>

Conversely, we can create a new DataArray object from a Cube usingfrom_iris():

  1. In [19]: da_cube = xr.DataArray.from_iris(cube)
  2.  
  3. In [20]: da_cube
  4. Out[20]:
  5. <xarray.DataArray (x: 4, y: 5)>
  6. array([[0.8529 , 0.235507, 0.146227, 0.589869, 0.574012],
  7. [0.06127 , 0.590426, 0.24535 , 0.340445, 0.984729],
  8. [0.91954 , 0.037772, 0.861549, 0.753569, 0.405179],
  9. [0.343526, 0.170917, 0.394659, 0.641666, 0.274592]])
  10. Coordinates:
  11. * x (x) int64 10 20 30 40
  12. * y (y) datetime64[ns] 2000-01-01 2000-01-02 ... 2000-01-04 2000-01-05

OPeNDAP

xarray includes support for OPeNDAP (via the netCDF4 library or Pydap), whichlets us access large datasets over HTTP.

For example, we can open a connection to GBs of weather data produced by thePRISM project, and hosted by IRI at Columbia:

  1. In [21]: remote_data = xr.open_dataset(
  2. ....: 'http://iridl.ldeo.columbia.edu/SOURCES/.OSU/.PRISM/.monthly/dods',
  3. ....: decode_times=False)
  4. ....:
  5.  
  6. In [22]: remote_data
  7. Out[22]:
  8. <xarray.Dataset>
  9. Dimensions: (T: 1422, X: 1405, Y: 621)
  10. Coordinates:
  11. * X (X) float32 -125.0 -124.958 -124.917 -124.875 -124.833 -124.792 -124.75 ...
  12. * T (T) float32 -779.5 -778.5 -777.5 -776.5 -775.5 -774.5 -773.5 -772.5 -771.5 ...
  13. * Y (Y) float32 49.9167 49.875 49.8333 49.7917 49.75 49.7083 49.6667 49.625 ...
  14. Data variables:
  15. ppt (T, Y, X) float64 ...
  16. tdmean (T, Y, X) float64 ...
  17. tmax (T, Y, X) float64 ...
  18. tmin (T, Y, X) float64 ...
  19. Attributes:
  20. Conventions: IRIDL
  21. expires: 1375315200

Note

Like many real-world datasets, this dataset does not entirely followCF conventions. Unexpected formats will usually cause xarray’s automaticdecoding to fail. The way to work around this is to either setdecode_cf=False in open_dataset to turn off all use of CFconventions, or by only disabling the troublesome parser.In this case, we set decode_times=False because the time axis hereprovides the calendar attribute in a format that xarray does not expect(the integer 360 instead of a string like '360_day').

We can select and slice this data any number of times, and nothing is loadedover the network until we look at particular values:

  1. In [23]: tmax = remote_data['tmax'][:500, ::3, ::3]
  2.  
  3. In [24]: tmax
  4. Out[24]:
  5. <xarray.DataArray 'tmax' (T: 500, Y: 207, X: 469)>
  6. [48541500 values with dtype=float64]
  7. Coordinates:
  8. * Y (Y) float32 49.9167 49.7917 49.6667 49.5417 49.4167 49.2917 ...
  9. * X (X) float32 -125.0 -124.875 -124.75 -124.625 -124.5 -124.375 ...
  10. * T (T) float32 -779.5 -778.5 -777.5 -776.5 -775.5 -774.5 -773.5 ...
  11. Attributes:
  12. pointwidth: 120
  13. standard_name: air_temperature
  14. units: Celsius_scale
  15. expires: 1443657600
  16.  
  17. # the data is downloaded automatically when we make the plot
  18. In [25]: tmax[0].plot()

_images/opendap-prism-tmax.pngSome servers require authentication before we can access the data. For thispurpose we can explicitly create a PydapDataStoreand pass in a Requests session object. For example forHTTP Basic authentication:

  1. import xarray as xr
  2. import requests
  3.  
  4. session = requests.Session()
  5. session.auth = ('username', 'password')
  6.  
  7. store = xr.backends.PydapDataStore.open('http://example.com/data',
  8. session=session)
  9. ds = xr.open_dataset(store)

Pydap’s cas module has functions that generate custom sessions forservers that use CAS single sign-on. For example, to connect to serversthat require NASA’s URS authentication:

  1. import xarray as xr
  2. from pydata.cas.urs import setup_session
  3.  
  4. ds_url = 'https://gpm1.gesdisc.eosdis.nasa.gov/opendap/hyrax/example.nc'
  5.  
  6. session = setup_session('username', 'password', check_url=ds_url)
  7. store = xr.backends.PydapDataStore.open(ds_url, session=session)
  8.  
  9. ds = xr.open_dataset(store)

Rasterio

GeoTIFFs and other gridded raster datasets can be opened using rasterio, ifrasterio is installed. Here is an example of how to useopen_rasterio() to read one of rasterio’s test files:

  1. In [26]: rio = xr.open_rasterio('RGB.byte.tif')
  2.  
  3. In [27]: rio
  4. Out[27]:
  5. <xarray.DataArray (band: 3, y: 718, x: 791)>
  6. [1703814 values with dtype=uint8]
  7. Coordinates:
  8. * band (band) int64 1 2 3
  9. * y (y) float64 2.827e+06 2.826e+06 2.826e+06 2.826e+06 2.826e+06 ...
  10. * x (x) float64 1.021e+05 1.024e+05 1.027e+05 1.03e+05 1.033e+05 ...
  11. Attributes:
  12. res: (300.0379266750948, 300.041782729805)
  13. transform: (300.0379266750948, 0.0, 101985.0, 0.0, -300.041782729805, 28...
  14. is_tiled: 0
  15. crs: +init=epsg:32618

The x and y coordinates are generated out of the file’s metadata(bounds, width, height), and they can be understood as cartesiancoordinates defined in the file’s projection provided by the crs attribute.crs is a PROJ4 string which can be parsed by e.g. pyproj or rasterio.See Parsing rasterio’s geocoordinates for an example of how to convert these tolongitudes and latitudes.

Warning

This feature has been added in xarray v0.9.6 and should still beconsidered as being experimental. Please report any bug you may findon xarray’s github repository.

Zarr

Zarr is a Python package providing an implementation of chunked, compressed,N-dimensional arrays.Zarr has the ability to store arrays in a range of ways, including in memory,in files, and in cloud-based object storage such as Amazon S3 andGoogle Cloud Storage.Xarray’s Zarr backend allows xarray to leverage these capabilities.

Warning

Zarr support is still an experimental feature. Please report any bugs orunexepected behavior via github issues.

Xarray can’t open just any zarr dataset, because xarray requires specialmetadata (attributes) describing the dataset dimensions and coordinates.At this time, xarray can only open zarr datasets that have been written byxarray. To write a dataset with zarr, we use theDataset.to_zarr method.To write to a local directory, we pass a path to a directory

  1. In [28]: ds = xr.Dataset({'foo': (('x', 'y'), np.random.rand(4, 5))},
  2. ....: coords={'x': [10, 20, 30, 40],
  3. ....: 'y': pd.date_range('2000-01-01', periods=5),
  4. ....: 'z': ('x', list('abcd'))})
  5. ....:
  6.  
  7. In [29]: ds.to_zarr('path/to/directory.zarr')
  8. Out[29]: <xarray.backends.zarr.ZarrStore at 0x7f33f3d022e8>

(The suffix .zarr is optional–just a reminder that a zarr store livesthere.) If the directory does not exist, it will be created. If a zarrstore is already present at that path, an error will be raised, preventing itfrom being overwritten. To override this behavior and overwrite an existingstore, add mode='w' when invoking to_zarr.

It is also possible to append to an existing store. For that, add mode='a'and set append_dim to the name of the dimension along which to append.

  1. In [30]: ds1 = xr.Dataset({'foo': (('x', 'y', 't'), np.random.rand(4, 5, 2))},
  2. ....: coords={'x': [10, 20, 30, 40],
  3. ....: 'y': [1,2,3,4,5],
  4. ....: 't': pd.date_range('2001-01-01', periods=2)})
  5. ....:
  6.  
  7. In [31]: ds1.to_zarr('path/to/directory.zarr')
  8. Out[31]: <xarray.backends.zarr.ZarrStore at 0x7f33f36b7f60>
  9.  
  10. In [32]: ds2 = xr.Dataset({'foo': (('x', 'y', 't'), np.random.rand(4, 5, 2))},
  11. ....: coords={'x': [10, 20, 30, 40],
  12. ....: 'y': [1,2,3,4,5],
  13. ....: 't': pd.date_range('2001-01-03', periods=2)})
  14. ....:
  15.  
  16. In [33]: ds2.to_zarr('path/to/directory.zarr', mode='a', append_dim='t')
  17. Out[33]: <xarray.backends.zarr.ZarrStore at 0x7f33f3d71438>

To store variable length strings use dtype=object.

To read back a zarr dataset that has been created this way, we use theopen_zarr() method:

  1. In [34]: ds_zarr = xr.open_zarr('path/to/directory.zarr')
  2.  
  3. In [35]: ds_zarr
  4. Out[35]:
  5. <xarray.Dataset>
  6. Dimensions: (t: 4, x: 4, y: 5)
  7. Coordinates:
  8. * t (t) datetime64[ns] 2001-01-01 2001-01-02 2001-01-03 2001-01-04
  9. * x (x) int64 10 20 30 40
  10. * y (y) int64 1 2 3 4 5
  11. Data variables:
  12. foo (x, y, t) float64 dask.array<shape=(4, 5, 4), chunksize=(4, 5, 2)>

Cloud Storage Buckets

It is possible to read and write xarray datasets directly from / to cloudstorage buckets using zarr. This example uses the gcsfs package to providea MutableMapping interface to Google Cloud Storage, which we can thenpass to xarray:

  1. import gcsfs
  2. fs = gcsfs.GCSFileSystem(project='<project-name>', token=None)
  3. gcsmap = gcsfs.mapping.GCSMap('<bucket-name>', gcs=fs, check=True, create=False)
  4. # write to the bucket
  5. ds.to_zarr(store=gcsmap)
  6. # read it back
  7. ds_gcs = xr.open_zarr(gcsmap)

Zarr Compressors and Filters

There are many different options for compression and filtering possible withzarr. These are described in thezarr documentation.These options can be passed to the to_zarr method as variable encoding.For example:

  1. In [36]: import zarr
  2.  
  3. In [37]: compressor = zarr.Blosc(cname='zstd', clevel=3, shuffle=2)
  4.  
  5. In [38]: ds.to_zarr('foo.zarr', encoding={'foo': {'compressor': compressor}})
  6. Out[38]: <xarray.backends.zarr.ZarrStore at 0x7f33f366e208>

Note

Not all native zarr compression and filtering options have been tested withxarray.

Consolidated Metadata

Xarray needs to read all of the zarr metadata when it opens a dataset.In some storage mediums, such as with cloud object storage (e.g. amazon S3),this can introduce significant overhead, because two separate HTTP calls to theobject store must be made for each variable in the dataset.With version 2.3, zarr will support a feature called consolidated metadata,which allows all metadata for the entire dataset to be stored with a singlekey (by default called .zmetadata). This can drastically speed upopening the store. (For more information on this feature, consult thezarr docs.)

If you have zarr version 2.3 or greater, xarray can write and read storeswith consolidated metadata. To write consolidated metadata, pass theconsolidated=True option to theDataset.to_zarr method:

  1. ds.to_zarr('foo.zarr', consolidated=True)

To read a consolidated store, pass the consolidated=True option toopen_zarr():

  1. ds = xr.open_zarr('foo.zarr', consolidated=True)

Xarray can’t perform consolidation on pre-existing zarr datasets. This shouldbe done directly from zarr, as described in thezarr docs.

GRIB format via cfgrib

xarray supports reading GRIB files via ECMWF cfgrib python driver and ecCodesC-library, if they are installed. To open a GRIB file supply engine='cfgrib'to open_dataset():

  1. In [39]: ds_grib = xr.open_dataset('example.grib', engine='cfgrib')

We recommend installing ecCodes via conda:

  1. conda install -c conda-forge eccodes
  2. pip install cfgrib

Formats supported by PyNIO

xarray can also read GRIB, HDF4 and other file formats supported by PyNIO,if PyNIO is installed. To use PyNIO to read such files, supplyengine='pynio' to open_dataset().

We recommend installing PyNIO via conda:

  1. conda install -c conda-forge pynio

Formats supported by PseudoNetCDF

xarray can also read CAMx, BPCH, ARL PACKED BIT, and many other fileformats supported by PseudoNetCDF, if PseudoNetCDF is installed.PseudoNetCDF can also provide Climate Forecasting Conventions toCMAQ files. In addition, PseudoNetCDF can automatically register customreaders that subclass PseudoNetCDF.PseudoNetCDFFile. PseudoNetCDF canidentify readers heuristically, or format can be specified via a key inbackend_kwargs.

To use PseudoNetCDF to read such files, supplyengine='pseudonetcdf' to open_dataset().

Add backendkwargs={'format': '<format name>'} where <format name>_options are listed on the PseudoNetCDF page.

CSV and other formats supported by Pandas

For more options (tabular formats and CSV files in particular), considerexporting your objects to pandas and using its broad range of IO tools.For CSV files, one might also consider xarray_extras.

Combining multiple files

NetCDF files are often encountered in collections, e.g., with different filescorresponding to different model runs. xarray can straightforwardly combine suchfiles into a single Dataset by making use of concat(),merge(), combine_nested() andcombine_by_coords(). For details on the difference between thesefunctions see Combining data.

Note

Xarray includes support for manipulating datasets that don’t fit into memorywith dask. If you have dask installed, you can open multiple filessimultaneously using open_mfdataset():

  1. xr.open_mfdataset('my/files/*.nc')

This function automatically concatenates and merges multiple files into asingle xarray dataset.It is the recommended way to open multiple files with xarray.For more details, see Combining along multiple dimensions, Reading and writing data and ablog post by Stephan Hoyer.

For example, here’s how we could approximate MFDataset from the netCDF4library:

  1. from glob import glob
  2. import xarray as xr
  3.  
  4. def read_netcdfs(files, dim):
  5. # glob expands paths with * to a list of files, like the unix shell
  6. paths = sorted(glob(files))
  7. datasets = [xr.open_dataset(p) for p in paths]
  8. combined = xr.concat(dataset, dim)
  9. return combined
  10.  
  11. combined = read_netcdfs('/all/my/files/*.nc', dim='time')

This function will work in many cases, but it’s not very robust. First, itnever closes files, which means it will fail one you need to load more thana few thousands file. Second, it assumes that you want all the data from eachfile and that it can all fit into memory. In many situations, you only needa small subset or an aggregated summary of the data from each file.

Here’s a slightly more sophisticated example of how to remedy thesedeficiencies:

  1. def read_netcdfs(files, dim, transform_func=None):
  2. def process_one_path(path):
  3. # use a context manager, to ensure the file gets closed after use
  4. with xr.open_dataset(path) as ds:
  5. # transform_func should do some sort of selection or
  6. # aggregation
  7. if transform_func is not None:
  8. ds = transform_func(ds)
  9. # load all data from the transformed dataset, to ensure we can
  10. # use it after closing each original file
  11. ds.load()
  12. return ds
  13.  
  14. paths = sorted(glob(files))
  15. datasets = [process_one_path(p) for p in paths]
  16. combined = xr.concat(datasets, dim)
  17. return combined
  18.  
  19. # here we suppose we only care about the combined mean of each file;
  20. # you might also use indexing operations like .sel to subset datasets
  21. combined = read_netcdfs('/all/my/files/*.nc', dim='time',
  22. transform_func=lambda ds: ds.mean())

This pattern works well and is very robust. We’ve used similar code to processtens of thousands of files constituting 100s of GB of data.