Dask.distributed

Dask.distributed is a lightweight library for distributed computing in Python.It extends both the concurrent.futures and dask APIs to moderate sizedclusters.

See the quickstart to get started.

Motivation

Distributed serves to complement the existing PyData analysis stack.In particular it meets the following needs:

  • Low latency: Each task suffers about 1ms of overhead. A smallcomputation and network roundtrip can complete in less than 10ms.
  • Peer-to-peer data sharing: Workers communicate with each other to sharedata. This removes central bottlenecks for data transfer.
  • Complex Scheduling: Supports complex workflows (not justmap/filter/reduce) which are necessary for sophisticated algorithms used innd-arrays, machine learning, image processing, and statistics.
  • Pure Python: Built in Python using well-known technologies. This easesinstallation, improves efficiency (for Python users), and simplifies debugging.
  • Data Locality: Scheduling algorithms cleverly execute computations wheredata lives. This minimizes network traffic and improves efficiency.
  • Familiar APIs: Compatible with the concurrent.futures API in thePython standard library. Compatible with dask API for parallelalgorithms
  • Easy Setup: As a Pure Python package distributed is pip installableand easy to set up on your own cluster.

Architecture

Dask.distributed is a centrally managed, distributed, dynamic task scheduler.The central dask-scheduler process coordinates the actions of severaldask-worker processes spread across multiple machines and the concurrentrequests of several clients.

The scheduler is asynchronous and event driven, simultaneously responding torequests for computation from multiple clients and tracking the progress ofmultiple workers. The event-driven and asynchronous nature makes it flexibleto concurrently handle a variety of workloads coming from multiple users at thesame time while also handling a fluid worker population with failures andadditions. Workers communicate amongst each other for bulk data transfer overTCP.

Internally the scheduler tracks all work as a constantly changing directedacyclic graph of tasks. A task is a Python function operating on Pythonobjects, which can be the results of other tasks. This graph of tasks grows asusers submit more computations, fills out as workers complete tasks, andshrinks as users leave or become disinterested in previous results.

Users interact by connecting a local Python session to the scheduler andsubmitting work, either by individual calls to the simple interfaceclient.submit(function, args, *kwargs) or by using the large datacollections and parallel algorithms of the parent dask library. Thecollections in the dask library like dask.array and dask.dataframeprovide easy access to sophisticated algorithms and familiar APIs like NumPyand Pandas, while the simple client.submit interface provides users withcustom control when they want to break out of canned “big data” abstractionsand submit fully custom workloads.

Contents

Getting Started

Build Understanding

Additional Features

Developer Documentation