Frequently Asked Questions

More questions can be found on StackOverflow at http://stackoverflow.com/search?tab=votes&q=dask%20distributed

How do I use external modules?

Use client.upload_file. For more detail, see the API docs and aStackOverflow question“Can I use functions imported from .py files in Dask/Distributed?”This function supports both standalone file and setuptools’s .egg filesfor larger modules.

Too many open file descriptors?

Your operating system imposes a limit to how many open files or open networkconnections any user can have at once. Depending on the scale of yourcluster the dask-scheduler may run into this limit.

By default most Linux distributions set this limit at 1024 openfiles/connections and OS-X at 128 or 256. Each worker adds a few openconnections to a running scheduler (somewhere between one and ten, depending onhow contentious things get.)

If you are on a managed cluster you can usually ask whoever manages yourcluster to increase this limit. If you have root access and know what you aredoing you can change the limits on Linux by editing/etc/security/limits.conf. Instructions are here under the heading “UserLevel FD Limits”:http://www.cyberciti.biz/faq/linux-increase-the-maximum-number-of-open-files/

Error when running dask-worker about OMP_NUM_THREADS

For more problems with OMP_NUM_THREADS, seehttp://stackoverflow.com/questions/39422092/error-with-omp-num-threads-when-using-dask-distributed

Does Dask handle Data Locality?

Yes, both data locality in memory and data locality on disk.

Often it’s much cheaper to move computations to where data lives. If one ofyour tasks creates a large array and a future task computes the sum of thatarray, you want to be sure that the sum runs on the same worker that has thearray in the first place, otherwise you’ll wait for a long while as the datamoves between workers. Needless communication can easily dominate costs ifwe’re sloppy.

The Dask Scheduler tracks the location and size of every intermediate valueproduced by every worker and uses this information when assigning future tasksto workers. Dask tries to make computations more efficient by minimizing datamovement.

Sometimes your data is on a hard drive or other remote storage that isn’tcontrolled by Dask. In this case the scheduler is unaware of exactly where yourdata lives, so you have to do a bit more work. You can tell Dask topreferentially run a task on a particular worker or set of workers.

For example Dask developers use this ability to build in data locality when wecommunicate to data-local storage systems like the Hadoop File System. Whenusers use high-level functions likedask.dataframe.read_csv('hdfs:///path/to/files.*.csv') Dask talks to theHDFS name node, finds the locations of all of the blocks of data, and sendsthat information to the scheduler so that it can make smarter decisions andimprove load times for users.

PermissionError [Errno 13] Permission Denied: /root/.dask

This error can be seen when starting distributed through the standard processcontrol tool supervisor and running as a non-root user. This is causedby supervisor not passing the shell environment variables through to thesubprocess, head to this section of the supervisor documentation to seehow to pass the $HOME and $USER variables through.