Institutional FAQ

Question: Is appropriate for adoption within a larger institutional context?

Answer: Yes. Dask is used within the world’s largest banks, national labs,retailers, technology companies, and government agencies. It is used in highlysecure environments. It is used in conservative institutions as well as fastmoving ones.

This page contains Frequently Asked Questions and concerns from institutionswhen they first investigate Dask.

For Management

Briefly, what problem does Dask solve for us?

Dask is a general purpose parallel programming solution.As such it is used in many different ways.

However, the most common problem that Dask solves is connecting Python analyststo distributed hardware, particularly for data science and machine learningworkloads. The institutions for whom Dask has the greatestimpact are those who have a large body of Python users who are accustomed tolibraries like NumPy, Pandas, Jupyter, Scikit-Learn and others, but want toscale those workloads across a cluster. Often they also have distributedcomputing resources that are going underused.

Dask removes both technological and cultural barriers to connect Python usersto computing resources in a way that is native to both the users and IT.

Help me scale my notebook onto the cluster” is a common pain point forinstitutions today, and it is a common entry point for Dask usage.

Is Dask mature? Why should we trust it?

Yes. While Dask itself is relatively new (it began in 2015) it is built by theNumPy, Pandas, Jupyter, Scikit-Learn developer community, which is well trusted.Dask is a relatively thin wrapper on top of these libraries and,as a result, the project can be relatively small and simple.It doesn’t reinvent a whole new system.

Additionally, this tight integration with the broader technology stackgives substantial benefits long term. For example:

  • Because Pandas maintainers also maintain Dask,when Pandas issues a new releases Dask issues a release at the same time toensure continuity and compatibility.
  • Because Scikit-Learn maintainers maintain and use Dask when they train on large clusters,you can be assured that Dask-ML focuses on pragmatic and importantsolutions like XGBoost integration, and hyper-parameter selection,and that the integration between the two feels natural for novice andexpert users alike.
  • Because Jupyter maintainers also maintain Dask,powerful Jupyter technologies like JupyterHub and JupyterLab are designedwith Dask’s needs in mind, and new features are pushed quickly to provide afirst class and modern user experience.

Additionally, Dask is maintained both by a broad community of maintainers,as well as substantial institutional support (several full-time employees each)by both Anaconda, the company behind the leading data science distribution, andNVIDIA, the leading hardware manufacturer of GPUs. Despite large corporatesupport, Dask remains a community governed project, and is fiscally sponsored byNumFOCUS, the same 501c3 that fiscally sponsors NumPy, Pandas, Jupyter, and many others.

Who else uses Dask?

Dask is used by individual researchers in practically every field today. Ithas millions of downloads per month, and is integrated into many PyDatasoftware packages today.

On an institutional level Dask is used by analytics and research groups in asimilarly broad set of domains across both energetic startups as well as largeconservative household names. A web search shows articles by Capital One,Barclays, Walmart, NASA, Los Alamos National Laboratories, and hundreds ofother similar institutions.

How does Dask compare with Apache Spark?

This question has longer and more technical coveragehere

Dask and Apache Spark are similar in that they both …

  • Promise easy parallelism for data science Python users
  • Provide Dataframe and ML APIs for ETL, data science, and machine learning
  • Scale out to similar scales, around 1-1000 machines

Dask differs from Apache Spark in a few ways:

  • Dask is more Python native, Spark is Scala/JVM native with Python bindings.

Python users may find Dask more comfortable,but Dask is only useful for Python users,while Spark can also be used from JVM languages.

  • Dask is one component in the broader Python ecosystem alongside librarieslike Numpy, Pandas, and Scikit-Learn,while Spark is an all-in-one system that re-invents much of the Python worldin a single package.

This means that it’s often easier to compose Dask with new problem domains,but also that you need to install multiple things (like Dask and Pandas orDask and Numpy) rather than just having everything in an all-in-one solution.

  • Apache Spark focuses strongly on traditional business intelligence workloads,like ETL, SQL queries, and then some lightweight machine learning,while Dask is more general purpose.

This means that Dask is much more flexible and can handle other problemdomains like multi-dimensional arrays, GIS, advanced machine learning, andcustom systems, but that it is less focused and less tuned on typical SQLstyle computations.

If you mostly want to focus on SQL queries then Spark is probably a betterbet. If you want to support a wide variety of custom workloads then Daskmight be more natural.

For IT

How would I set up Dask on institutional hardware?

You already have cluster resources.Dask can run on them today without significant change.

Most institutional clusters today have a resource manager.This is typically managed by IT, with some mild permissions given to users tolaunch jobs. Dask works with all major resource managers today, includingthose on Hadoop, HPC, Kubernetes, and Cloud clusters.

  • Hadoop/Spark: If you have a Hadoop/Spark cluster, such as one purchasedthrough Cloudera/Hortonworks/MapR then you will likely want to deploy Daskwith YARN, the resource manager that deploys services like Hadoop, Spark,Hive, and others.

To help with this, you’ll likely want to use Dask-Yarn.

  • HPC: If you have an HPC machine that runs resource managers like SGE,SLLURM, PBS, LSF, Torque, Condor, or other job batch queuing systems, thenusers can launch Dask on these systems today using either:

  • Kubernetes/Cloud: Newer clusters may employ Kubernetes for deployment.This is particularly commonly used today on major cloud providers,all of which provide hosted Kubernetes as a service. People today use Daskon Kubernetes using either of the following:

    • Helm: an easy way to stand up a long-running Dask cluster andJupyter notebook
    • Dask-Kubernetes: for native Kubernetes integration for fast movingor ephemeral deployments.For more information see Kubernetes

Is Dask secure?

Dask is deployed today within highly secure institutions,including major financial, healthcare, and government agencies.

That being said it’s worth noting that, by it’s very nature, Dask enables theexecution of arbitrary user code on a large set of machines. Care should betaken to isolate, authenticate, and govern access to these machines. Fortunately,your institution likely already does this and uses standard technologies likeSSL/TLS, Kerberos, and other systems with which Dask can integrate.

Do I need to purchase a new cluster?

No. It is easy to run Dask today on most clusters.If you have a pre-existing HPC or Spark/Hadoop cluster then that will be fineto start running Dask.

You can start using Dask without any capital expenditure.

How do I manage users?

Dask doesn’t manage users, you likely have existing systems that do this well.In a large institutional setting we assume that you already have a resourcemanager like Yarn (Hadoop), Kubernetes, or PBS/SLURM/SGE/LSF/…, each of whichhave excellent user management capabilities, which are likely preferred by yourIT department anyway.

Dask is designed to operate with user-level permissions, which means thatyour data science users should be able to ask those systems mentioned above forresources, and have their processes tracked accordingly.

However, there are institutions where analyst-level users aren’t given direct access tothe cluster. This is particularly common in Cloudera/Hortonworks Hadoop/Spark deployments.In these cases some level of explicit indirection may be required. For this, werecommend the Dask Gateway project, which uses IT-levelpermissions to properly route authenticated users into secure resources.

How do I manage software environments?

This depends on your cluster resource manager:

  • Most HPC users use their network file system
  • Hadoop/Spark/Yarn users package their environment into a tarball and ship itaround with HDFS (Dask-Yarn integrates with Conda Pack for this capability)
  • Kubernetes or Cloud users use Docker images

In each case Dask integrates with existing processes and technologiesthat are well understood and familiar to the institution.

How does Dask communicate data between machines?

Dask usually communicates over TCP, using msgpack for small administrativemessages, and its own protocol for efficiently passing around large data.The scheduler and each worker host their own TCP server, making Dask adistributed peer-to-peer network that uses point-to-point communication.We do not use Spark-style shuffle systems. We do not use MPI-stylecollectives. Everything is direct point-to-point.

For high performance networks you can use either TCP-over-Infiniband for about1 GB/s bandwidth, or UCX (experimental) for full speed communication.

Are deployments long running, or ephemeral?

We see both, but ephemeral deployments are more common.

Most Dask use today is about enabling data science or data engineering users toscale their interactive workloads across the cluster.These are typically either interactive sessions with Jupyter, or batch scriptsthat run at a pre-defined time. In both cases, the user asks the resourcemanager for a bunch of machines, does some work, and then gives up thosemachines.

Some institutions also use Dask in an always-on fashion, either handlingreal-time traffic in a scalable way, or responding to a broad set ofinteractive users with large datasets that it keeps resident in memory.

For Technical Leads

Will Dask “just work” on our existing code?

No, you will need to make modifications,but these modifications are usually small.

The vast majority of lines of business logic within your institutionwill not have to change, assuming that they are in Python and use tooling likeNumpy, Pandas and Scikit-Learn.

How well does Dask scale? What are Dask’s limitations?

The largest Dask deployments that we see today are on around 1000 multi-coremachines, perhaps 20,000 cores in total, but these are rare.Most institutional-level problems (1-100 TB) are well solved by deployments of 10-50 nodes.

Technically, the back-of-the-envelope number to keep in mind is that each task(an individual Python function call) in Dask has an overhead of around 200microseconds. So if these tasks take 1 second each, then Dask can saturatearound 5000 cores before scheduling overhead dominates costs. As workloadsreach this limit they are encouraged to use larger chunk sizes to compensate.The vast majority of institutional users though do not reach this limit.For more information you may want to peruse our best practices

Is Dask resilient? What happens when a machine goes down?

Yes, Dask is resilient to the failure of worker nodes. It knows how it came toany result, and can replay the necessary work on other machines if one goesdown.

If Dask’s centralized scheduler goes down then you would need to resubmit thecomputation. This is a fairly standard level of resiliency today, shared withother tooling like Apache Spark, Flink, and others.

The resource managers that host Dask, like Yarn or Kubernetes, typicallyprovide long-term 24/7 resilience for always-on operation.

Is the API exactly the same as NumPy/Pandas/Scikit-Learn?

No, but it’s very close. That being said your data scientists will stillhave to learn some things.

What we find is that the Numpy/Pandas/Scikit-Learn APIs aren’t the challengewhen institutions adopt Dask. When API inconsistencies do exist, evenmodestly skilled programmers are able to understand why and work around themwithout much pain.

Instead, the challenge is building intuition around parallel performance.We’ve all built up a mental model for what is fast and slow on a singlemachine. This model changes when we factor in network communication andparallel algorithms, and the performance that we get for familiar operationscan be surprising.

Our main solution to build this intuition, other thanaccumulated experience, is Dask’s Diagnostic Dashboard.The dashboard delivers a ton of visual feedback to users as they are runningtheir computation to help them understand what is going on. This both helpsthem to identify and resolve immediate bottlenecks, and also builds up thatparallel performance intuition surprisingly quickly.

How much performance tuning does Dask require?

Some other systems are notoriously hard to tune for optimal performance.What is Dask’s story here? How many knobs are there that we need to be awareof?

Like the rest of the Python software tools, Dask puts a lot of effort intohaving sane defaults. Dask workers automatically detect available memory andcores, and choose sensible defaults that are decent in most situations. Daskalgorithms similarly provide decent choices by default, and informative warningswhen tricky situations arise, so that, in common cases, things should be fine.

The most common knobs to tune include the following:

  • The thread/process mixture to deal with GIL-holding computations (which arerare in Numpy/Pandas/Scikit-Learn workflows)
  • Partition size, like if should you have 100 MB chunks or 1 GB chunks

That being said, almost no institution’s needs are met entirely by the commoncase, and given the variety of problems that people throw at Dask,exceptional problems are commonplace.In these cases we recommend watching the dashboard during execution to see whatis going on. It can commonly inform you what’s going wrong, so that you canmake changes to your system.

What Data formats does Dask support?

Because Dask builds on NumPy and Pandas, it supports most formats that theysupport, which is most formats.That being said, not all formats are well suited forparallel access. In general people using the following formats are usuallypretty happy:

  • Tabular: Parquet, ORC, CSV, Line Delimited JSON, Avro, text
  • Arrays: HDF5, NetCDF, Zarr, GRIB

More generally, if you have a Python function that turns a chunk of your storeddata into a Pandas dataframe or Numpy array then Dask can probably call thatfunction many times without much effort.

For groups looking for advice on which formats to use, we recommend Parquetfor tables and Zarr or HDF5 for arrays.

Does Dask have a SQL interface?

No. Dask provides no SQL support. Dask dataframe looks like and uses Pandasfor these sorts of operations. It would be great to see someone build a SQLinterface on top of Pandas, which Dask could then use, but this is out of scopefor the core Dask project itself.

As with Pandas though, we do support a dask.dataframe.from_sql command forefficiently pulling data out of SQL databases for Pandas computations.