Comparison to Spark

Apache Spark is a popular distributed computingtool for tabular datasets that is growing to become a dominant name in Big Dataanalysis today. Dask has several elements that appear to intersect this spaceand we are often asked, “How does Dask compare with Spark?”

Answering such comparison questions in an unbiased and informed way is hard,particularly when the differences can be somewhat technical. This documenttries to do this; we welcome any corrections.

Summary

Generally Dask is smaller and lighter weight than Spark. This means that ithas fewer features and, instead, is used in conjunction with other libraries,particularly those in the numeric Python ecosystem. It couples with librarieslike Pandas or Scikit-Learn to achieve high-level functionality.

Language

  • Spark is written in Scala with some support for Python and R. Itinteroperates well with other JVM code.
  • Dask is written in Python and only really supports Python. Itinteroperates well with C/C++/Fortran/LLVM or other natively compiledcode linked through Python.

Ecosystem

  • Spark is an all-in-one project that has inspired its own ecosystem. Itintegrates well with many other Apache projects.
  • Dask is a component of the larger Python ecosystem. It couples with andenhances other libraries like NumPy, Pandas, and Scikit-Learn.

Age and Trust

  • Spark is older (since 2010) and has become a dominant andwell-trusted tool in the Big Data enterprise world.
  • Dask is younger (since 2014) and is an extension of thewell trusted NumPy/Pandas/Scikit-learn/Jupyter stack.

Scope

  • Spark is more focused on traditional business intelligenceoperations like SQL and lightweight machine learning.
  • Dask is applied more generally both to business intelligenceapplications, as well as a number of scientific and custom situations.

Internal Design

  • Spark’s internal model is higher level, providing good high leveloptimizations on uniformly applied computations, but lacking flexibilityfor more complex algorithms or ad-hoc systems. It is fundamentally anextension of the Map-Shuffle-Reduce paradigm.
  • Dask’s internal model is lower level, and so lacks high leveloptimizations, but is able to implement more sophisticated algorithms andbuild more complex bespoke systems. It is fundamentally based on generictask scheduling.

Scale

  • Spark scales from a single node to thousand-node clusters.
  • Dask scales from a single node to thousand-node clusters.

APIs

DataFrames

  • Spark DataFrame has its own API and memory model. It alsoimplements a large subset of the SQL language. Spark includes ahigh-level query optimizer for complex queries.
  • Dask DataFrame reuses the Pandas API and memory model. It implementsneither SQL nor a query optimizer. It is able to do random access,efficient time series operations, and other Pandas-style indexedoperations.

Machine Learning

  • Spark MLLib is a cohesive project with support for common operationsthat are easy to implement with Spark’s Map-Shuffle-Reduce stylesystem. People considering MLLib might also want to consider _other_JVM-based machine learning libraries like H2O, which may have betterperformance.
  • Dask relies on and interoperates with existing libraries likeScikit-Learn and XGBoost. These can be more familiar or higherperformance, but generally results in a less-cohesive whole. See thedask-ml project for integrations.

Arrays

  • Spark does not include support for multi-dimensional arrays natively(this would be challenging given their computation model), althoughsome support for two-dimensional matrices may be found in MLLib.People may also want to look at theThunder project, whichcombines Apache Spark with NumPy arrays.
  • Dask fully supports the NumPy model forscalable multi-dimensional arrays.

Streaming

  • Spark’s support for streaming data is first-class and integrates wellinto their other APIs. It follows a mini-batch approach. Thisprovides decent performance on large uniform streaming operations.
  • Dask provides a real-time futures interface that islower-level than Spark streaming. This enables more creative andcomplex use-cases, but requires more work than Spark streaming.

Graphs / complex networks

  • Spark provides GraphX, a library for graph processing.
  • Dask provides no such library.

Custom parallelism

  • Spark generally expects users to compose computations out of theirhigh-level primitives (map, reduce, groupby, join, …). It is alsopossible to extend Spark through subclassing RDDs, although this israrely done.
  • Dask allows you to specify arbitrary task graphs for more complex andcustom systems that are not part of the standard set of collections.

Reasons you might choose Spark

  • You prefer Scala or the SQL language
  • You have mostly JVM infrastructure and legacy systems
  • You want an established and trusted solution for business
  • You are mostly doing business analytics with some lightweight machine learning
  • You want an all-in-one solution

Reasons you might choose Dask

  • You prefer Python or native code, or have large legacy code bases that youdo not want to entirely rewrite
  • Your use case is complex or does not cleanly fit the Spark computing model
  • You want a lighter-weight transition from local computing to clustercomputing
  • You want to interoperate with other technologies and don’t mind installingmultiple packages

Reasons to choose both

It is easy to use both Dask and Spark on the same data and on the same cluster.

They can both read and write common formats, like CSV, JSON, ORC, and Parquet,making it easy to hand results off between Dask and Spark workflows.

They can both deploy on the same clusters.Most clusters are designed to support many different distributed systems at thesame time, using resource managers like Kubernetes and YARN. If you alreadyhave a cluster on which you run Spark workloads, it’s likely easy to also runDask workloads on your current infrastructure and vice versa.

In particular, for users coming from traditional Hadoop/Spark clusters (such asthose sold by Cloudera/Hortonworks) you are using the Yarn resourcemanager. You can deploy Dask on these systems using the Dask Yarn project, as well as other projects, like JupyterHubon Hadoop.

Developer-Facing Differences

Graph Granularity

Both Spark and Dask represent computations with directed acyclic graphs. Thesegraphs however represent computations at very different granularities.

One operation on a Spark RDD might add a node like Map and Filter tothe graph. These are high-level operations that convey meaning and willeventually be turned into many little tasks to execute on individual workers.This many-little-tasks state is only available internally to the Sparkscheduler.

Dask graphs skip this high-level representation and go directly to themany-little-tasks stage. As such, one map operation on a Dask collectionwill immediately generate and add possibly thousands of tiny tasks to the Daskgraph.

This difference in the scale of the underlying graph has implications on thekinds of analysis and optimizations one can do and also on the generality thatone exposes to users. Dask is unable to perform some optimizations that Sparkcan because Dask schedulers do not have a top-down picture of the computationthey were asked to perform. However, Dask is able to easily represent far morecomplex algorithms and expose the creation of these algorithms to normal users.

Conclusion

  • Spark is mature and all-inclusive. If you want a single project that doeseverything and you’re already on Big Data hardware, then Spark is a safe bet,especially if your use cases are typical ETL + SQL and you’re already usingScala.
  • Dask is lighter weight and is easier to integrate into existing code and hardware.If your problems vary beyond typical ETL + SQL and you want to add flexibleparallelism to existing solutions, then Dask may be a good fit, especially ifyou are already using Python and associated libraries like NumPy and Pandas.

If you are looking to manage a terabyte or less of tabular CSV or JSON data,then you should forget both Spark and Dask and use Postgres or MongoDB.