Joblib

Many Scikit-Learn algorithms are written for parallel execution usingJoblib, which natively providesthread-based and process-based parallelism. Joblib is what backs then_jobs= parameter in normal use of Scikit-Learn.

Dask can scale these Joblib-backed algorithms out to a cluster of machines byproviding an alternative Joblib backend. The following video demonstrates howto use Dask to parallelize a grid search across a cluster.

To use the Dask backend to Joblib you have to create a Client, and wrap yourcode with joblib.parallel_backend('dask').

  1. from dask.distributed import Client
  2. import joblib
  3.  
  4. client = Client(processes=False) # create local cluster
  5. # client = Client("scheduler-address:8786") # or connect to remote cluster
  6.  
  7. with joblib.parallel_backend('dask'):
  8. # Your scikit-learn code

As an example you might distribute a randomized cross validated parametersearch as follows:

  1. import numpy as np
  2. from dask.distributed import Client
  3.  
  4. import joblib
  5. from sklearn.datasets import load_digits
  6. from sklearn.model_selection import RandomizedSearchCV
  7. from sklearn.svm import SVC
  8.  
  9. client = Client(processes=False) # create local cluster
  10.  
  11. digits = load_digits()
  12.  
  13. param_space = {
  14. 'C': np.logspace(-6, 6, 13),
  15. 'gamma': np.logspace(-8, 8, 17),
  16. 'tol': np.logspace(-4, -1, 4),
  17. 'class_weight': [None, 'balanced'],
  18. }
  19.  
  20. model = SVC(kernel='rbf')
  21. search = RandomizedSearchCV(model, param_space, cv=3, n_iter=50, verbose=10)
  22.  
  23. with joblib.parallel_backend('dask'):
  24. search.fit(digits.data, digits.target)

Note that the Dask joblib backend is useful for scaling out CPU-bound workloads;workloads with datasets that fit in RAM, but have many individual operationsthat can be done in parallel. To scale out to RAM-bound workloads(larger-than-memory datasets) use one of the following alternatives: