DataFrame

A Dask DataFrame is a large parallel DataFrame composed of many smaller PandasDataFrames, split along the index. These Pandas DataFrames may live on diskfor larger-than-memory computing on a single machine, or on many differentmachines in a cluster. One Dask DataFrame operation triggers many operationson the constituent Pandas DataFrames.

Examples

Visit https://examples.dask.org/dataframe.html to see and run examples usingDask DataFrame.

Design

Dask DataFrames coordinate many Pandas DataFrames/Series arranged along theindex. A Dask DataFrame is partitioned row-wise, grouping rows by index valuefor efficiency. These Pandas objects may live on disk or on other machines.Dask DataFrames coordinate many Pandas DataFrames

Dask DataFrame copies the Pandas API

Because the dask.dataframe application programming interface (API) is asubset of the Pandas API, it should be familiar to Pandas users. There are someslight alterations due to the parallel nature of Dask:

  1. >>> import dask.dataframe as dd
  2. >>> df = dd.read_csv('2014-*.csv')
  3. >>> df.head()
  4. x y
  5. 0 1 a
  6. 1 2 b
  7. 2 3 c
  8. 3 4 a
  9. 4 5 b
  10. 5 6 c
  11.  
  12. >>> df2 = df[df.y == 'a'].x + 1

As with all Dask collections, one triggers computation by calling the.compute() method:

  1. >>> df2.compute()
  2. 0 2
  3. 3 5
  4. Name: x, dtype: int64

Common Uses and Anti-Uses

Dask DataFrame is used in situations where Pandas is commonly needed, usually whenPandas fails due to data size or computation speed:

  • Manipulating large datasets, even when those datasets don’t fit in memory
  • Accelerating long computations by using many cores
  • Distributed computing on large datasets with standard Pandas operations likegroupby, join, and time series computations

Dask DataFrame may not be the best choice in the following situations:

  • If your dataset fits comfortably into RAM on your laptop, then you may bebetter off just using Pandas. There may be simpler ways to improveperformance than through parallelism
  • If your dataset doesn’t fit neatly into the Pandas tabular model, then youmight find more use in dask.bag or dask.array
  • If you need functions that are not implemented in Dask DataFrame, then youmight want to look at dask.delayed which offers moreflexibility
  • If you need a proper database with all that databases offer you might prefersomething like Postgres

Scope

Dask DataFrame covers a well-used portion of the Pandas API.The following class of computations works well:

    • Trivially parallelizable operations (fast):
      • Element-wise operations: df.x + df.y, df * df
      • Row-wise selections: df[df.x > 0]
      • Loc: df.loc[4.0:10.5]
      • Common aggregations: df.x.max(), df.max()
      • Is in: df[df.x.isin([1, 2, 3])]
      • Date time/string accessors: df.timestamp.month
    • Cleverly parallelizable operations (fast):
      • groupby-aggregate (with common aggregations): df.groupby(df.x).y.max(),df.groupby('x').max()
      • groupby-apply on index: df.groupby(['idx', 'x']).apply(myfunc), whereidx is the index level name
      • value_counts: df.x.value_counts()
      • Drop duplicates: df.x.drop_duplicates()
      • Join on index: dd.merge(df1, df2, left_index=True, right_index=True)or dd.merge(df1, df2, on=['idx', 'x']) where idx is the indexname for both df1 and df2
      • Join with Pandas DataFrames: dd.merge(df1, df2, on='id')
      • Element-wise operations with different partitions / divisions: df1.x + df2.y
      • Date time resampling: df.resample(…)
      • Rolling averages: df.rolling(…)
      • Pearson’s correlation: df[['col1', 'col2']].corr()
    • Operations requiring a shuffle (slow-ish, unless on index)
      • Set index: df.set_index(df.x)
      • groupby-apply not on index (with anything): df.groupby(df.x).apply(myfunc)
      • Join not on the index: dd.merge(df1, df2, on='name')

However, Dask DataFrame does not implement the entire Pandas interface. Usersexpecting this will be disappointed. Notably, Dask DataFrame has the followinglimitations:

  • Setting a new index from an unsorted column is expensive
  • Many operations like groupby-apply and join on unsorted columns requiresetting the index, which as mentioned above, is expensive
  • The Pandas API is very large. Dask DataFrame does not attempt to implementmany Pandas features or any of the more exotic data structures like NDFrames
  • Operations that were slow on Pandas, like iterating through row-by-row,remain slow on Dask DataFrameSee DataFrame API documentation for a more extensive list.

Execution

By default, Dask DataFrame uses the multi-threaded scheduler.This exposes some parallelism when Pandas or the underlying NumPy operationsrelease the global interpreter lock (GIL). Generally, Pandas is more GILbound than NumPy, so multi-core speed-ups are not as pronounced forDask DataFrame as they are for Dask Array. This is changing, andthe Pandas development team is actively working on releasing the GIL.

When dealing with text data, you may see speedups by switching to the newerdistributed scheduler either on a cluster orsingle machine.