Limitations

Dask.distributed has limitations. Understanding these can help you to reliablycreate efficient distributed computations.

Performance

  • The central scheduler spends a few hundred microseconds on every task. Foroptimal performance, task durations should be greater than 10-100ms.
  • Dask can not parallelize within individual tasks. Individual tasks shouldbe a comfortable size so as not to overwhelm any particular worker.
  • Dask assigns tasks to workers heuristically. It usually makes the rightdecision, but non-optimal situations do occur.
  • The workers are just Python processes, and inherit all capabilities andlimitations of Python. They do not bound or limit themselves in any way.In production you may wish to run dask-workers within containers.

Assumptions on Functions and Data

Dask assumes the following about your functions and your data:

  • All functions must be serializable either with pickle orcloudpickle. This isusually the case except in fairly exotic situations. Thefollowing should work:
  1. from cloudpickle import dumps, loads
  2. loads(dumps(my_object))
  • All data must be serializable either with pickle, cloudpickle, or usingDask’s custom serialization system.

  • Dask may run your functions multiple times,such as if a worker holding an intermediate result dies. Any side effectsshould be idempotent.

Security

As a distributed computing framework, Dask enables the remote execution ofarbitrary code. You should only host dask-workers within networks that youtrust. This is standard among distributed computing frameworks, but is worthrepeating.