Resilience

Resilience

Software fails, Hardware fails, network connections fail, user code fails.This document describes how dask.distributed responds in the face of thesefailures and other known bugs.

User code failures

When a function raises an error that error is kept and transmitted to theclient on request. Any attempt to gather that result or any dependentresult will raise that exception.

>>> def div(a, b):
...     return a / b
 
>>> x = client.submit(div, 1, 0)
>>> x.result()
ZeroDivisionError: division by zero
 
>>> y = client.submit(add, x, 10)
>>> y.result()  # same error as above
ZeroDivisionError: division by zero

This does not affect the smooth operation of the scheduler or worker in anyway.

Closed Network Connections

If the connection to a remote worker unexpectedly closes and the local processappropriately raises an IOError then the scheduler will reroute all pendingcomputations to other workers.

If the lost worker was the only worker to hold vital results necessary forfuture computations then those results will be recomputed by surviving workers.The scheduler maintains a full history of how each result was produced and so isable to reproduce those same computations on other workers.

This has some fail cases.

If results depend on impure functions then you may get a different(although still entirely accurate) result
If the worker failed due to a bad function, for example a function thatcauses a segmentation fault, then that bad function will repeatedly becalled on other workers. This function will be marked as “bad” after itkills a fixed number of workers (defaults to three).
Data sent out directly to the workers via a call to scatter() (insteadof being created from a Dask task graph via other Dask functions) is notkept in the scheduler, as it is often quite large, and so the loss of thisdata is irreparable. You may wish to call Client.replicate on the datawith a suitable replication factor to ensure that it remains long-lived orelse back the data off of some resilient store, like a file system.

Hardware Failures

It is not clear under which circumstances the local process will know that theremote worker has closed the connection. If the socket does not close cleanlythen the system will wait for a timeout, roughly three seconds, before markingthe worker as failed and resuming smooth operation.

Scheduler Failure

The process containing the scheduler might die. There is currently nopersistence mechanism to record and recover the scheduler state.

The workers and clients will all reconnect to the scheduler after it comes backonline but records of ongoing computations will be lost.

Restart and Nanny Processes

The client provides a mechanism to restart all of the workers in the cluster.This is convenient if, during the course of experimentation, you find yourworkers in an inconvenient state that makes them unresponsive. TheClient.restart method kills all workers, flushes all scheduler state, andthen brings all workers back online, resulting in a clean cluster. Thisrequires the nanny process (which is started by default).