4.7. Distributed sort

Distributed sort allows to sort data which exceeds query.max-memory-per-node.Distributed sort is enabled via distributed_sort session property ordistributed-sort configuration property set inetc/config.properties of the coordinator. Distributed sort is enabled bydefault.

When distributed sort is enabled, sort operator executes in parallel on multiplenodes in the cluster. Partially sorted data from each Presto worker node is then streamedto a single worker node for a final merge. This technique allows to utilize memory of multiplePresto worker nodes for sorting. The primary purpose of distributed sort is to allow for sortingof data sets which don’t normally fit into single node memory. Performance improvementcan be expected, but it won’t scale linearly with the number of nodes since thedata needs to be merged by a single node.