Dask-ML

Dask-ML provides scalable machine learning in Python using Dask alongsidepopular machine learning libraries like Scikit-Learn.

You can try Dask-ML on a small cloud instance by clicking the following button:https://mybinder.org/badge.svg

  1. import dask.dataframe as dd
  2. df = dd.read_parquet('...')
  3. data = df[['age', 'income', 'married']]
  4. labels = df['outcome']
  5.  
  6. from dask_ml.linear_model import LogisticRegression
  7. lr = LogisticRegression()
  8. lr.fit(data, labels)

What does this offer?

The overarching goal of Dask-ML is to enable scalable machine learning.See the navigation pane to the left for details on specific features.

How does this work?

Modern machine learning algorithms employ a wide variety of techniques.Scaling these requires a similarly wide variety of different approaches.Generally solutions fall into the following three categories:

Parallelize Scikit-Learn Directly

Scikit-Learn already provides parallel computing on a single machine withJoblib. Dask extends thisparallelism to many machines in a cluster. This works well for modest datasizes but large computations, such as random forests, hyper-parameteroptimization, and more.

  1. from dask.distributed import Client
  2. import joblib
  3.  
  4. client = Client() # Connect to a Dask Cluster
  5.  
  6. with joblib.parallel_backend('dask'):
  7. # Your normal scikit-learn code here

See Dask-ML Joblib documentation for more information.

Note that this is an active collaboration with the Scikit-Learn developmentteam. This functionality is progressing quickly but is in a state of rapidchange.

Reimplement Scalable Algorithms with Dask Array

Some machine learning algorithms are easy to write down as Numpy algorithms.In these cases we can replace Numpy arrays with Dask arrays to achieve scalablealgorithms easily. This is employed for linear models, pre-processing, and clustering.

  1. from dask_ml.preprocessing import Categorizer, DummyEncoder
  2. from dask_ml.linear_model import LogisticRegression
  3.  
  4. lr = LogisticRegression()
  5. lr.fit(data, labels)

Partner with other distributed libraries

Other machine learning libraries like XGBoost and TensorFlow already havedistributed solutions that work quite well. Dask-ML makes no attempt tore-implement these systems. Instead, Dask-ML makes it easy to use normal Daskworkflows to prepare and set up data, then it deploys XGBoost or Tensorflowalongside Dask, and hands the data over.

  1. from dask_ml.xgboost import XGBRegressor
  2.  
  3. est = XGBRegressor(...)
  4. est.fit(train, train_labels)

See Dask-ML + XGBoost or Dask-ML + TensorFlow documentation for more information.

Scikit-Learn API

In all cases Dask-ML endeavors to provide a single unified interface around thefamiliar NumPy, Pandas, and Scikit-Learn APIs. Users familiar withScikit-Learn should feel at home with Dask-ML.