Publish Datasets

A published dataset is a named reference to a Dask collection or list offutures that has been published to the cluster. It is available for any clientto see and persists beyond the scope of an individual session.

Publishing datasets is useful in the following cases:

  • You want to share computations with colleagues
  • You want to persist results on the cluster between interactive sessions

Motivating Example

In this example we load a dask.dataframe from S3, manipulate it, and thenpublish the result.

Connect and Load

  1. from dask.distributed import Client
  2. client = Client('scheduler-address:8786')
  3.  
  4. import dask.dataframe as dd
  5. df = dd.read_csv('s3://my-bucket/*.csv')
  6. df2 = df[df.balance < 0]
  7. df2 = client.persist(df2)
  8.  
  9. >>> df2.head()
  10. name balance
  11. 0 Alice -100
  12. 1 Bob -200
  13. 2 Charlie -300
  14. 3 Dennis -400
  15. 4 Edith -500

Publish

To share this collection with a colleague we publish it under the name'negative_accounts'

  1. client.publish_dataset(negative_accounts=df2)

Load published dataset from different client

Now any other client can connect to the scheduler and retrieve this publisheddataset.

  1. >>> from dask.distributed import Client
  2. >>> client = Client('scheduler-address:8786')
  3.  
  4. >>> client.list_datasets()
  5. ['negative_accounts']
  6.  
  7. >>> df = client.get_dataset('negative_accounts')
  8. >>> df.head()
  9. name balance
  10. 0 Alice -100
  11. 1 Bob -200
  12. 2 Charlie -300
  13. 3 Dennis -400
  14. 4 Edith -500

This allows users to easily share results. It also allows for the persistenceof important and commonly used datasets beyond a single session. Publisheddatasets continue to reside in distributed memory even after all clientsrequesting them have disconnected.

Dictionary interface

Alternatively you can use the .datasets mapping on the client to publish,list, get, and delete global datasets.

  1. >>> client.datasets['negative_accounts'] = df
  2.  
  3. >>> list(client.datasets)
  4. ['negative_accounts']
  5. >>> df = client.datasets['negative_accounts']

This mapping is globally shared among all clients connected to the samescheduler.

Notes

Published collections are not automatically persisted. If you publish anun-persisted collection then others will still be able to get the collectionfrom the scheduler, but operations on that collection will start from scratch.This allows you to publish views on data that do not permanently take upcluster memory but can be surprising if you expect “publishing” toautomatically make a computed dataset rapidly available.

Any client can publish or unpublish a dataset.

Publishing too many large datasets can quickly consume a cluster’s RAM.

API

Client.publish_dataset(self, *args, **kwargs)Publish named datasets to scheduler
Client.list_datasets(self, **kwargs)List named datasets available on the scheduler
Client.get_dataset(self, name, **kwargs)Get named dataset from the scheduler
Client.unpublish_dataset(self, name, **kwargs)Remove named datasets from scheduler