Array

Dask Array implements a subset of the NumPy ndarray interface using blockedalgorithms, cutting up the large array into many small arrays. This lets uscompute on arrays larger than memory using all of our cores. We coordinatethese blocked algorithms using Dask graphs.

Examples

Visit https://examples.dask.org/array.html to see and run examples usingDask Array.

Design

Dask arrays coordinate many numpy arrays

Dask arrays coordinate many NumPy arrays arranged into a grid. TheseNumPy arrays may live on disk or on other machines.

Common Uses

Dask Array is used in fields like atmospheric and oceanographic science, largescale imaging, genomics, numerical algorithms for optimization or statistics,and more.

Scope

Dask arrays support most of the NumPy interface like the following:

  • Arithmetic and scalar mathematics: +, *, exp, log, …
  • Reductions along axes: sum(), mean(), std(), sum(axis=0), …
  • Tensor contractions / dot products / matrix multiply: tensordot
  • Axis reordering / transpose: transpose
  • Slicing: x[:100, 500:100:-2]
  • Fancy indexing along single axes with lists or NumPy arrays: x[:, [10, 1, 5]]
  • Array protocols like array and array_ufunc
  • Some linear algebra: svd, qr, solve, solve_triangular, lstsq

However, Dask Array does not implement the entire NumPy interface. Users expecting thiswill be disappointed. Notably, Dask Array lacks the following features:

  • Much of np.linalg has not been implemented.This has been done by a number of excellent BLAS/LAPACK implementations,and is the focus of numerous ongoing academic research projects
  • Arrays with unknown shapes do not support all operations
  • Operations like sort which are notoriouslydifficult to do in parallel, and are of somewhat diminished value on verylarge data (you rarely actually need a full sort).Often we include parallel-friendly alternatives like topk
  • Dask Array doesn’t implement operations like tolist that would be veryinefficient for larger datasets. Likewise, it is very inefficient to iterateover a Dask array with for loops
  • Dask development is driven by immediate need, hence many lesser usedfunctions have not been implemented. Community contributions are encouraged

See the dask.array API for a more extensive list offunctionality.

Execution

By default, Dask Array uses the threaded scheduler in order to avoid datatransfer costs, and because NumPy releases the GIL well. It is also quiteeffective on a cluster using the dask.distributed scheduler.