You can run this notebook in a live sessionBinder or view it on Github.

Arrays

Dask array provides a parallel, larger-than-memory, n-dimensional array using blocked algorithms. Simply put: distributed Numpy.

  • Parallel: Uses all of the cores on your computer

  • Larger-than-memory: Lets you work on datasets that are larger than your available memory by breaking up your array into many small pieces, operating on those pieces in an order that minimizes the memory footprint of your computation, and effectively streaming data from disk.

  • Blocked Algorithms: Perform large computations by performing many smaller computations

In this notebook, we’ll build some understanding by implementing some blocked algorithms from scratch. We’ll then use Dask Array to analyze large datasets, in parallel, using a familiar NumPy-like API.

Related Documentation

Create data

  1. [1]:
  1. %run prep.py -d random

Setup

  1. [2]:
  1. from dask.distributed import Client
  2.  
  3. client = Client(n_workers=4, processes=False)

Blocked Algorithms

A blocked algorithm executes on a large dataset by breaking it up into many small blocks.

For example, consider taking the sum of a billion numbers. We might instead break up the array into 1,000 chunks, each of size 1,000,000, take the sum of each chunk, and then take the sum of the intermediate sums.

We achieve the intended result (one sum on one billion numbers) by performing many smaller results (one thousand sums on one million numbers each, followed by another sum of a thousand numbers.)

We do exactly this with Python and NumPy in the following example:

  1. [3]:
  1. # Load data with h5py
  2. # this creates a pointer to the data, but does not actually load
  3. import h5py
  4. import os
  5. f = h5py.File(os.path.join('data', 'random.hdf5'), mode='r')
  6. dset = f['/x']

Compute sum using blocked algorithm

Before using dask, lets consider the concept of blocked algorithms. We can compute the sum of a large number of elements by loading them chunk-by-chunk, and keeping a running total.

Here we compute the sum of this large array on disk by

  • Computing the sum of each 1,000,000 sized chunk of the array

  • Computing the sum of the 1,000 intermediate sums

Note that this is a sequential process in the notebook kernel, both the loading and summing.

  1. [4]:
  1. # Compute sum of large array, one million numbers at a time
  2. sums = []
  3. for i in range(0, 1000000000, 1000000):
  4. chunk = dset[i: i + 1000000] # pull out numpy array
  5. sums.append(chunk.sum())
  6.  
  7. total = sum(sums)
  8. print(total)
  1. 4999958.6875

Exercise: Compute the mean using a blocked algorithm

Now that we’ve seen the simple example above try doing a slightly more complicated problem, compute the mean of the array, assuming for a moment that we don’t happen to already know how many elements are in the data. You can do this by changing the code above with the following alterations:

  • Compute the sum of each block

  • Compute the length of each block

  • Compute the sum of the 1,000 intermediate sums and the sum of the 1,000 intermediate lengths and divide one by the other

This approach is overkill for our case but does nicely generalize if we don’t know the size of the array or individual blocks beforehand.

  1. [5]:
  1. # Compute the mean of the array
  2. sums = []
  3. lengths = []
  4. for i in range(0, 1000000000, 1000000):
  5. chunk = dset[i: i + 1000000] # pull out numpy array
  6. sums.append(chunk.sum())
  7. lengths.append(len(chunk))
  8.  
  9. total = sum(sums)
  10. length = sum(lengths)
  11. print(total / length)
  1. 0.9999917375

dask.array contains these algorithms

Dask.array is a NumPy-like library that does these kinds of tricks to operate on large datasets that don’t fit into memory. It extends beyond the linear problems discussed above to full N-Dimensional algorithms and a decent subset of the NumPy interface.

Create dask.array object

You can create a dask.array Array object with the da.from_array function. This function accepts

  • data: Any object that supports NumPy slicing, like dset

  • chunks: A chunk size to tell us how to block up our array, like (1000000,)

  1. [6]:
  1. import dask.array as da
  2. x = da.from_array(dset, chunks=(1000000,))
  3. x
  1. [6]:
Array Chunk
Bytes 20.00 MB 4.00 MB
Shape (5000000,) (1000000,)
Count 6 Tasks 5 Chunks
Type float32 numpy.ndarray

Arrays - 图2

Manipulate dask.array object as you would a numpy array

Now that we have an Array we perform standard numpy-style computations like arithmetic, mathematics, slicing, reductions, etc..

The interface is familiar, but the actual work is different. dask_array.sum() does not do the same thing as numpy_array.sum().

What’s the difference?

dask_array.sum() builds an expression of the computation. It does not do the computation yet. numpy_array.sum() computes the sum immediately.

Why the difference?

Dask arrays are split into chunks. Each chunk must have computations run on that chunk explicitly. If the desired answer comes from a small slice of the entire dataset, running the computation over all data would be wasteful of CPU and memory.

  1. [7]:
  1. result = x.sum()
  2. result
  1. [7]:
Array Chunk
Bytes 4 B 4 B
Shape () ()
Count 14 Tasks 1 Chunks
Type float32 numpy.ndarray

Compute result

Dask.array objects are lazily evaluated. Operations like .sum build up a graph of blocked tasks to execute.

We ask for the final result with a call to .compute(). This triggers the actual computation.

  1. [8]:
  1. result.compute()
  1. [8]:
  1. 4999958.5

Exercise: Compute the mean

And the variance, std, etc.. This should be a small change to the example above.

Look at what other operations you can do with the Jupyter notebook’s tab-completion.

  1. [ ]:
  1.  

Does this match your result from before?

Performance and Parallelism

In our first examples we used for loops to walk through the array one block at a time. For simple operations like sum this is optimal. However for complex operations we may want to traverse through the array differently. In particular we may want the following:

  • Use multiple cores in parallel

  • Chain operations on a single blocks before moving on to the next one

Dask.array translates your array operations into a graph of inter-related tasks with data dependencies between them. Dask then executes this graph in parallel with multiple threads. We’ll discuss more about this in the next section.

Example

  • Construct a 20000x20000 array of normally distributed random values broken up into 1000x1000 sized chunks

  • Take the mean along one axis

  • Take every 100th element

  1. [9]:
  1. import numpy as np
  2. import dask.array as da
  3.  
  4. x = da.random.normal(10, 0.1, size=(20000, 20000), # 400 million element array
  5. chunks=(1000, 1000)) # Cut into 1000x1000 sized chunks
  6. y = x.mean(axis=0)[::100] # Perform NumPy-style operations
  1. [10]:
  1. x.nbytes / 1e9 # Gigabytes of the input processed lazily
  1. [10]:
  1. 3.2
  1. [11]:
  1. %%time
  2. y.compute() # Time to compute the result
  1. CPU times: user 23 s, sys: 83.7 ms, total: 23 s
  2. Wall time: 11.7 s
  1. [11]:
  1. array([10.00203753, 9.9993978 , 9.9980022 , 10.00052401, 10.0015649 ,
  2. 10.00000116, 10.00031982, 9.99915236, 10.00015254, 10.00055251,
  3. 9.9997609 , 9.99903925, 9.99987194, 10.00009037, 10.00103502,
  4. 9.99996215, 9.99967873, 10.00148008, 9.99996552, 10.00042681,
  5. 9.99963818, 9.99966498, 9.99906043, 10.00070462, 10.00012912,
  6. 10.00170384, 9.99952734, 9.99982517, 10.00014967, 9.99897811,
  7. 10.00040895, 10.00049255, 10.00033684, 9.9996503 , 10.00017956,
  8. 9.99882847, 10.00117098, 10.00095035, 10.00055761, 10.00024769,
  9. 9.99977291, 10.00064692, 10.00060743, 10.00036827, 10.00040261,
  10. 9.99930687, 9.9997344 , 10.00041444, 10.00024799, 9.99996307,
  11. 9.99922705, 9.99965806, 9.9999219 , 9.99920876, 9.99995217,
  12. 10.00051788, 9.99966946, 10.00009827, 9.99892939, 9.99959811,
  13. 10.0003105 , 9.99955315, 10.00107629, 9.99994425, 9.9999754 ,
  14. 10.00052155, 10.00037634, 10.00122567, 10.00013269, 9.99981215,
  15. 10.00022511, 9.999799 , 10.00035762, 9.99940273, 9.9992383 ,
  16. 9.99909219, 10.00033829, 10.00033968, 10.00035576, 9.99923438,
  17. 9.99997572, 9.99932008, 10.00053097, 9.99930428, 10.00013967,
  18. 10.0001819 , 10.00181981, 10.00085837, 10.00055144, 9.99985314,
  19. 9.99936738, 10.00032964, 9.99990733, 9.99889819, 10.00016327,
  20. 10.00052869, 9.99848241, 10.0020514 , 9.99921771, 9.99998431,
  21. 9.99939612, 9.99953411, 10.00016584, 9.9999745 , 10.00094935,
  22. 10.00025267, 10.00022229, 9.99930273, 9.99925037, 9.99875239,
  23. 9.99965334, 9.99901503, 10.00066181, 9.99967289, 9.9997693 ,
  24. 9.9993434 , 10.00025992, 9.99940273, 10.00039467, 9.99963882,
  25. 9.99916121, 9.9989048 , 9.99925212, 9.9999126 , 10.0001267 ,
  26. 9.99962573, 9.99955934, 9.99907926, 10.00083547, 10.00025017,
  27. 10.00059017, 9.99985526, 9.99945142, 9.99907718, 9.99955577,
  28. 9.9997332 , 10.00103734, 9.99979971, 9.99981409, 9.99939172,
  29. 10.00065277, 9.99984971, 9.99988099, 9.99933878, 9.99987674,
  30. 10.00117044, 10.00011794, 10.00200667, 10.0006237 , 10.00016385,
  31. 9.99962223, 9.99998937, 9.99984696, 9.99921894, 9.99894293,
  32. 9.99914216, 9.99909344, 9.99884074, 9.99983064, 10.00026088,
  33. 9.99819632, 10.00093004, 10.0001569 , 10.00038558, 9.99988989,
  34. 10.00039085, 9.99958346, 10.00056266, 9.99936783, 9.99971825,
  35. 9.99947252, 9.99989208, 10.00048755, 9.99916958, 9.99925446,
  36. 9.99943453, 9.99992949, 10.00018561, 10.00038067, 9.99971971,
  37. 9.99927749, 9.99926869, 9.99906961, 9.99969765, 9.99976634,
  38. 10.0006786 , 9.99970671, 9.99959582, 9.99951612, 10.00029113,
  39. 9.99997426, 9.99954466, 9.99999882, 9.999663 , 10.00008455,
  40. 10.00117634, 9.99956758, 9.99990892, 9.99947216, 10.0012262 ])

Performance comparision

The following experiment was performed on a heavy personal laptop. Your performance may vary. If you attempt the NumPy version then please ensure that you have more than 4GB of main memory.

NumPy: 19s, Needs gigabytes of memory

  1. import numpy as np
  2.  
  3. %%time
  4. x = np.random.normal(10, 0.1, size=(20000, 20000))
  5. y = x.mean(axis=0)[::100]
  6. y
  7.  
  8. CPU times: user 19.6 s, sys: 160 ms, total: 19.8 s
  9. Wall time: 19.7 s

Dask Array: 4s, Needs megabytes of memory

  1. import dask.array as da
  2.  
  3. %%time
  4. x = da.random.normal(10, 0.1, size=(20000, 20000), chunks=(1000, 1000))
  5. y = x.mean(axis=0)[::100]
  6. y.compute()
  7.  
  8. CPU times: user 29.4 s, sys: 1.07 s, total: 30.5 s
  9. Wall time: 4.01 s

Discussion

Notice that the Dask array computation ran in 4 seconds, but used 29.4 seconds of user CPU time. The numpy computation ran in 19.7 seconds and used 19.6 seconds of user CPU time.

Dask finished faster, but used more total CPU time because Dask was able to transparently parallelize the computation because of the chunk size.

Questions

  • What happens if the dask chunks=(20000,20000)?

    • Will the computation run in 4 seconds?

    • How much memory will be used?

  • What happens if the dask chunks=(25,25)?

    • What happens to CPU and memory?

Exercise: Meteorological data

There is 2GB of somewhat artifical weather data in HDF5 files in data/weather-big/*.hdf5. We’ll use the h5py library to interact with this data and dask.array to compute on it.

Our goal is to visualize the average temperature on the surface of the Earth for this month. This will require a mean over all of this data. We’ll do this in the following steps

  • Create h5py.Dataset objects for each of the days of data on disk (dsets)

  • Wrap these with da.from_array calls

  • Stack these datasets along time with a call to da.stack

  • Compute the mean along the newly stacked time axis with the .mean() method

  • Visualize the result with matplotlib.pyplot.imshow

  1. [12]:
  1. %run prep.py -d weather
  1. [13]:
  1. import h5py
  2. from glob import glob
  3. import os
  4.  
  5. filenames = sorted(glob(os.path.join('data', 'weather-big', '*.hdf5')))
  6. dsets = [h5py.File(filename, mode='r')['/t2m'] for filename in filenames]
  7. dsets[0]
  1. [13]:
  1. <HDF5 dataset "t2m": shape (180, 360), type "<f8">
  1. [ ]:
  1.  
  1. [14]:
  1. dsets[0][:5, :5] # Slicing into h5py.Dataset object gives a numpy array
  1. [14]:
  1. array([[84.75, 84.75, 84.75, 84.75, 84.75],
  2. [83. , 83. , 83. , 83. , 83. ],
  3. [84.5 , 84. , 84. , 84. , 84. ],
  4. [81.25, 81.25, 81.25, 81.25, 81.25],
  5. [77.75, 77.75, 77.75, 77.75, 77.75]])
  1. [15]:
  1. %matplotlib inline
  2. import matplotlib.pyplot as plt
  3.  
  4. fig = plt.figure(figsize=(16, 8))
  5. plt.imshow(dsets[0][::4, ::4], cmap='RdBu_r');

_images/03_array_55_0.png

Integrate with dask.array

Make a list of dask.array objects out of your list of h5py.Dataset objects using the da.from_array function with a chunk size of (500, 500).

  1. [ ]:
  1.  
  1. [16]:
  1. arrays = [da.from_array(dset, chunks=(500, 500)) for dset in dsets]
  2. arrays
  1. [16]:
  1. [dask.array<array, shape=(180, 360), dtype=float64, chunksize=(180, 360), chunktype=numpy.ndarray>,
  2. dask.array<array, shape=(180, 360), dtype=float64, chunksize=(180, 360), chunktype=numpy.ndarray>,
  3. dask.array<array, shape=(180, 360), dtype=float64, chunksize=(180, 360), chunktype=numpy.ndarray>,
  4. dask.array<array, shape=(180, 360), dtype=float64, chunksize=(180, 360), chunktype=numpy.ndarray>,
  5. dask.array<array, shape=(180, 360), dtype=float64, chunksize=(180, 360), chunktype=numpy.ndarray>,
  6. dask.array<array, shape=(180, 360), dtype=float64, chunksize=(180, 360), chunktype=numpy.ndarray>,
  7. dask.array<array, shape=(180, 360), dtype=float64, chunksize=(180, 360), chunktype=numpy.ndarray>,
  8. dask.array<array, shape=(180, 360), dtype=float64, chunksize=(180, 360), chunktype=numpy.ndarray>,
  9. dask.array<array, shape=(180, 360), dtype=float64, chunksize=(180, 360), chunktype=numpy.ndarray>,
  10. dask.array<array, shape=(180, 360), dtype=float64, chunksize=(180, 360), chunktype=numpy.ndarray>,
  11. dask.array<array, shape=(180, 360), dtype=float64, chunksize=(180, 360), chunktype=numpy.ndarray>,
  12. dask.array<array, shape=(180, 360), dtype=float64, chunksize=(180, 360), chunktype=numpy.ndarray>,
  13. dask.array<array, shape=(180, 360), dtype=float64, chunksize=(180, 360), chunktype=numpy.ndarray>,
  14. dask.array<array, shape=(180, 360), dtype=float64, chunksize=(180, 360), chunktype=numpy.ndarray>,
  15. dask.array<array, shape=(180, 360), dtype=float64, chunksize=(180, 360), chunktype=numpy.ndarray>,
  16. dask.array<array, shape=(180, 360), dtype=float64, chunksize=(180, 360), chunktype=numpy.ndarray>,
  17. dask.array<array, shape=(180, 360), dtype=float64, chunksize=(180, 360), chunktype=numpy.ndarray>,
  18. dask.array<array, shape=(180, 360), dtype=float64, chunksize=(180, 360), chunktype=numpy.ndarray>,
  19. dask.array<array, shape=(180, 360), dtype=float64, chunksize=(180, 360), chunktype=numpy.ndarray>,
  20. dask.array<array, shape=(180, 360), dtype=float64, chunksize=(180, 360), chunktype=numpy.ndarray>,
  21. dask.array<array, shape=(180, 360), dtype=float64, chunksize=(180, 360), chunktype=numpy.ndarray>,
  22. dask.array<array, shape=(180, 360), dtype=float64, chunksize=(180, 360), chunktype=numpy.ndarray>,
  23. dask.array<array, shape=(180, 360), dtype=float64, chunksize=(180, 360), chunktype=numpy.ndarray>,
  24. dask.array<array, shape=(180, 360), dtype=float64, chunksize=(180, 360), chunktype=numpy.ndarray>,
  25. dask.array<array, shape=(180, 360), dtype=float64, chunksize=(180, 360), chunktype=numpy.ndarray>,
  26. dask.array<array, shape=(180, 360), dtype=float64, chunksize=(180, 360), chunktype=numpy.ndarray>,
  27. dask.array<array, shape=(180, 360), dtype=float64, chunksize=(180, 360), chunktype=numpy.ndarray>,
  28. dask.array<array, shape=(180, 360), dtype=float64, chunksize=(180, 360), chunktype=numpy.ndarray>,
  29. dask.array<array, shape=(180, 360), dtype=float64, chunksize=(180, 360), chunktype=numpy.ndarray>,
  30. dask.array<array, shape=(180, 360), dtype=float64, chunksize=(180, 360), chunktype=numpy.ndarray>,
  31. dask.array<array, shape=(180, 360), dtype=float64, chunksize=(180, 360), chunktype=numpy.ndarray>]

Stack this list of dask.array objects into a single dask.array object with da.stack

Stack these along the first axis so that the shape of the resulting array is (31, 5760, 11520).

  1. [ ]:
  1.  
  1. [17]:
  1. x = da.stack(arrays, axis=0)
  2. x
  1. [17]:
Array Chunk
Bytes 16.07 MB 518.40 kB
Shape (31, 180, 360) (1, 180, 360)
Count 93 Tasks 31 Chunks
Type float64 numpy.ndarray

Arrays - 图4

Plot the mean of this array along the time (0th) axis

  1. [18]:
  1. # complete the following:
  2. fig = plt.figure(figsize=(16, 8))
  3. plt.imshow(..., cmap='RdBu_r')
  1. ---------------------------------------------------------------------------
  2. TypeError Traceback (most recent call last)
  3. <ipython-input-18-d2371b8f5286> in <module>
  4. 1 # complete the following:
  5. 2 fig = plt.figure(figsize=(16, 8))
  6. ----> 3 plt.imshow(..., cmap='RdBu_r')
  7.  
  8. ~/miniconda/envs/test/lib/python3.7/site-packages/matplotlib/pyplot.py in imshow(X, cmap, norm, aspect, interpolation, alpha, vmin, vmax, origin, extent, shape, filternorm, filterrad, imlim, resample, url, data, **kwargs)
  9. 2675 filternorm=filternorm, filterrad=filterrad, imlim=imlim,
  10. 2676 resample=resample, url=url, **({"data": data} if data is not
  11. -> 2677 None else {}), **kwargs)
  12. 2678 sci(__ret)
  13. 2679 return __ret
  14.  
  15. ~/miniconda/envs/test/lib/python3.7/site-packages/matplotlib/__init__.py in inner(ax, data, *args, **kwargs)
  16. 1597 def inner(ax, *args, data=None, **kwargs):
  17. 1598 if data is None:
  18. -> 1599 return func(ax, *map(sanitize_sequence, args), **kwargs)
  19. 1600
  20. 1601 bound = new_sig.bind(ax, *args, **kwargs)
  21.  
  22. ~/miniconda/envs/test/lib/python3.7/site-packages/matplotlib/cbook/deprecation.py in wrapper(*args, **kwargs)
  23. 367 f"%(removal)s. If any parameter follows {name!r}, they "
  24. 368 f"should be pass as keyword, not positionally.")
  25. --> 369 return func(*args, **kwargs)
  26. 370
  27. 371 return wrapper
  28.  
  29. ~/miniconda/envs/test/lib/python3.7/site-packages/matplotlib/cbook/deprecation.py in wrapper(*args, **kwargs)
  30. 367 f"%(removal)s. If any parameter follows {name!r}, they "
  31. 368 f"should be pass as keyword, not positionally.")
  32. --> 369 return func(*args, **kwargs)
  33. 370
  34. 371 return wrapper
  35.  
  36. ~/miniconda/envs/test/lib/python3.7/site-packages/matplotlib/axes/_axes.py in imshow(self, X, cmap, norm, aspect, interpolation, alpha, vmin, vmax, origin, extent, shape, filternorm, filterrad, imlim, resample, url, **kwargs)
  37. 5677 resample=resample, **kwargs)
  38. 5678
  39. -> 5679 im.set_data(X)
  40. 5680 im.set_alpha(alpha)
  41. 5681 if im.get_clip_path() is None:
  42.  
  43. ~/miniconda/envs/test/lib/python3.7/site-packages/matplotlib/image.py in set_data(self, A)
  44. 683 not np.can_cast(self._A.dtype, float, "same_kind")):
  45. 684 raise TypeError("Image data of dtype {} cannot be converted to "
  46. --> 685 "float".format(self._A.dtype))
  47. 686
  48. 687 if not (self._A.ndim == 2
  49.  
  50. TypeError: Image data of dtype object cannot be converted to float

_images/03_array_65_1.png

  1. [19]:
  1. result = x.mean(axis=0)
  2. fig = plt.figure(figsize=(16, 8))
  3. plt.imshow(result, cmap='RdBu_r');

_images/03_array_66_0.png

Plot the difference of the first day from the mean

  1. [ ]:
  1.  
  1. [20]:
  1. result = x[0] - x.mean(axis=0)
  2. fig = plt.figure(figsize=(16, 8))
  3. plt.imshow(result, cmap='RdBu_r');

_images/03_array_69_0.png

Exercise: Subsample and store

In the above exercise the result of our computation is small, so we can call compute safely. Sometimes our result is still too large to fit into memory and we want to save it to disk. In these cases you can use one of the following two functions

  • da.store: Store dask.array into any object that supports numpy setitem syntax, e.g.
  1. f = h5py.File('myfile.hdf5')
  2. output = f.create_dataset(shape=..., dtype=...)
  3.  
  4. da.store(my_dask_array, output)
  • da.to_hdf5: A specialized function that creates and stores a dask.array object into an HDF5 file.
  1. da.to_hdf5('data/myfile.hdf5', '/output', my_dask_array)

The task in this exercise is to use numpy step slicing to subsample the full dataset by a factor of two in both the latitude and longitude direction and then store this result to disk using one of the functions listed above.

As a reminder, Python slicing takes three elements

  1. start:stop:step
  2.  
  3. >>> L = [1, 2, 3, 4, 5, 6, 7]
  4. >>> L[::3]
  5. [1, 4, 7]
  1. [ ]:
  1.  
  1. [ ]:
  1.  
  1. [21]:
  1. import h5py
  2. from glob import glob
  3. import os
  4. import dask.array as da
  5.  
  6. filenames = sorted(glob(os.path.join('data', 'weather-big', '*.hdf5')))
  7. dsets = [h5py.File(filename, mode='r')['/t2m'] for filename in filenames]
  8.  
  9. arrays = [da.from_array(dset, chunks=(500, 500)) for dset in dsets]
  10.  
  11. x = da.stack(arrays, axis=0)
  12.  
  13. result = x[:, ::2, ::2]
  14.  
  15. da.to_zarr(result, os.path.join('data', 'myfile.zarr'), overwrite=True)

Example: Lennard-Jones potential

The Lennard-Jones is used in partical simuluations in physics, chemistry and engineering. It is highly parallelizable.

First, we’ll run and profile the Numpy version on 7,000 particles.

[22]:
import numpy as np

# make a random collection of particles
def make_cluster(natoms, radius=40, seed=1981):
    np.random.seed(seed)
    cluster = np.random.normal(0, radius, (natoms,3))-0.5
    return cluster

def lj(r2):
    sr6 = (1./r2)**3
    pot = 4.*(sr6*sr6 - sr6)
    return pot

# build the matrix of distances
def distances(cluster):
    diff = cluster[:, np.newaxis, :] - cluster[np.newaxis, :, :]
    mat = (diff*diff).sum(-1)
    return mat

# the lj function is evaluated over the upper traingle
# after removing distances near zero
def potential(cluster):
    d2 = distances(cluster)
    dtri = np.triu(d2)
    energy = lj(dtri[dtri > 1e-6]).sum()
    return energy
[23]:
cluster = make_cluster(int(7e3), radius=500)
[24]:
%time potential(cluster)
distributed.worker - WARNING - Worker is at 84% memory usage. Pausing worker.  Process memory: 2.00 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 2.00 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Worker is at 95% memory usage. Pausing worker.  Process memory: 2.22 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 2.22 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Worker is at 106% memory usage. Pausing worker.  Process memory: 2.45 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 2.45 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Worker is at 117% memory usage. Pausing worker.  Process memory: 2.65 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 2.66 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 2.67 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 2.68 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 2.69 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 2.69 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 2.79 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 2.79 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 2.79 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 2.79 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 3.04 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 3.04 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 3.04 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 3.04 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 3.04 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 3.04 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 3.04 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 3.04 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 3.04 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 3.04 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 3.04 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 3.04 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 3.04 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 3.04 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 3.04 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 3.04 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Worker is at 45% memory usage. Resuming worker. Process memory: 953.95 MB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Worker is at 45% memory usage. Resuming worker. Process memory: 960.24 MB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Worker is at 46% memory usage. Resuming worker. Process memory: 962.33 MB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Worker is at 46% memory usage. Resuming worker. Process memory: 966.53 MB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.50 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.50 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.50 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.50 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.52 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.52 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.52 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.52 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.55 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.55 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.55 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.55 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.57 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.57 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.57 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.57 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.60 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.60 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.60 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.60 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.62 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.62 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.62 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.62 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.64 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.64 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.64 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.64 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.67 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.67 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.67 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.67 GB -- Worker memory limit: 2.09 GB
CPU times: user 4.04 s, sys: 1.31 s, total: 5.35 s
Wall time: 4.58 s
[24]:
-0.21282893668845293

Notice that the most time consuming function is distances:

[25]:
# this would open in another browser tab
# %load_ext snakeviz
# %snakeviz potential(cluster)

# alternative simple version given text results in this tab
%prun -s tottime potential(cluster)
distributed.worker - WARNING - Worker is at 85% memory usage. Pausing worker.  Process memory: 1.80 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.80 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Worker is at 86% memory usage. Pausing worker.  Process memory: 1.81 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.81 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Worker is at 86% memory usage. Pausing worker.  Process memory: 1.82 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.82 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Worker is at 87% memory usage. Pausing worker.  Process memory: 1.82 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.83 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 2.44 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 2.45 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 2.45 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 2.45 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 3.04 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 3.04 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 3.04 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 3.04 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 3.04 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 3.04 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 3.04 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 3.04 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 3.04 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 3.04 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 3.04 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 3.04 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 3.04 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 3.04 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 3.04 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 3.04 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Worker is at 38% memory usage. Resuming worker. Process memory: 796.76 MB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Worker is at 38% memory usage. Resuming worker. Process memory: 800.96 MB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Worker is at 38% memory usage. Resuming worker. Process memory: 805.15 MB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Worker is at 38% memory usage. Resuming worker. Process memory: 807.25 MB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.49 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.49 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.49 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.49 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.52 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.52 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.52 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.52 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.54 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.54 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.54 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.54 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.57 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.57 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.57 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.57 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.59 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.59 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.59 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.59 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.61 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.61 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.61 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.61 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.64 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.64 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.64 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.64 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.66 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.66 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.66 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.66 GB -- Worker memory limit: 2.09 GB







Dask version

Here’s the Dask version. Only the potential function needs to be rewritten to best utilize Dask.

Note that da.nansum has been used over the full (NxN) distance matrix to improve parallel efficiency.

[26]:
import dask.array as da

# compute the potential on the entire
# matrix of distances and ignore division by zero
def potential_dask(cluster):
    d2 = distances(cluster)
    energy = da.nansum(lj(d2))/2.
    return energy

Let’s convert the NumPy array to a Dask array. Since the entire NumPy array fits in memory it is more computationally efficient to chunk the array by number of CPU cores.

[27]:
from os import cpu_count

dcluster = da.from_array(cluster, chunks=cluster.shape[0]//cpu_count())

This step should scale quite well with number of cores. The warnings are complaining about dividing by zero, which is why we used da.nansum in potential_dask.

[28]:
e = potential_dask(dcluster)
%time e.compute()
distributed.worker - WARNING - Worker is at 96% memory usage. Pausing worker.  Process memory: 2.03 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Worker is at 97% memory usage. Pausing worker.  Process memory: 2.04 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Worker is at 97% memory usage. Pausing worker.  Process memory: 2.05 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Worker is at 98% memory usage. Pausing worker.  Process memory: 2.05 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 2.07 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 2.09 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 2.09 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 2.10 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 2.78 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 2.78 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 2.79 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 2.79 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 3.04 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 3.04 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 3.04 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 3.04 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 3.04 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 3.04 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 3.04 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 3.04 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Worker is at 40% memory usage. Resuming worker. Process memory: 839.07 MB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Worker is at 40% memory usage. Resuming worker. Process memory: 843.44 MB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Worker is at 40% memory usage. Resuming worker. Process memory: 847.64 MB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Worker is at 40% memory usage. Resuming worker. Process memory: 849.74 MB -- Worker memory limit: 2.09 GB
/home/travis/miniconda/envs/test/lib/python3.7/site-packages/dask/core.py:119: RuntimeWarning: divide by zero encountered in true_divide
  return func(*args2)
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.66 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.66 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.66 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 1.67 GB -- Worker memory limit: 2.09 GB
/home/travis/miniconda/envs/test/lib/python3.7/site-packages/dask/core.py:119: RuntimeWarning: invalid value encountered in subtract
  return func(*args2)
distributed.worker - WARNING - Worker is at 104% memory usage. Pausing worker.  Process memory: 2.18 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 2.18 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Worker is at 104% memory usage. Pausing worker.  Process memory: 2.19 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 2.19 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Worker is at 104% memory usage. Pausing worker.  Process memory: 2.19 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 2.19 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Worker is at 105% memory usage. Pausing worker.  Process memory: 2.21 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 2.22 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Worker is at 61% memory usage. Resuming worker. Process memory: 1.29 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Worker is at 61% memory usage. Resuming worker. Process memory: 1.29 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Worker is at 61% memory usage. Resuming worker. Process memory: 1.30 GB -- Worker memory limit: 2.09 GB
distributed.worker - WARNING - Worker is at 62% memory usage. Resuming worker. Process memory: 1.30 GB -- Worker memory limit: 2.09 GB
CPU times: user 7.59 s, sys: 1.38 s, total: 8.96 s
Wall time: 4.55 s
[28]:
-0.21282893668845299

Limitations

Dask Array does not implement the entire numpy interface. Users expecting this will be disappointed. Notably Dask Array has the following failings:

  • Dask does not implement all of np.linalg. This has been done by a number of excellent BLAS/LAPACK implementations and is the focus of numerous ongoing academic research projects.

  • Dask Array does not support some operations where the resulting shape depends on the values of the array. For those that it does support (for example, masking one Dask Array with another boolean mask), the chunk sizes will be unknown, which may cause issues with other operations that need to know the chunk sizes.

  • Dask Array does not attempt operations like sort which are notoriously difficult to do in parallel and are of somewhat diminished value on very large data (you rarely actually need a full sort). Often we include parallel-friendly alternatives like topk.

  • Dask development is driven by immediate need, and so many lesser used functions, like np.sometrue have not been implemented purely out of laziness. These would make excellent community contributions.

[29]:
client.shutdown()