4.4. Spill to Disk

Overview

In the case of memory intensive operations, Presto allows offloadingintermediate operation results to disk. The goal of this mechanism is toenable execution of queries that require amounts of memory exceeding per queryor per node limits.

The mechanism is similar to OS level page swapping. However, it isimplemented on the application level to address specific needs of Presto.

Properties related to spilling are described in Spilling Properties.

Memory Management and Spill

By default, Presto kills queries if the memory requested by the query executionexceeds session properties query_max_memory orquery_max_memory_per_node. This mechanism ensures fairness in allocationof memory to queries and prevents deadlock caused by memory allocation.It is efficient when there is a lot of small queries in the cluster, butleads to killing large queries that don’t stay within the limits.

To overcome this inefficiency, the concept of revocable memory was introduced. Aquery can request memory that does not count toward the limits, but this memorycan be revoked by the memory manager at any time. When memory is revoked, thequery runner spills intermediate data from memory to disk and continues toprocess it later.

In practice, when the cluster is idle, and all memory is available, a memoryintensive query may use all of the memory in the cluster. On the other hand,when the cluster does not have much free memory, the same query may be forced touse disk as storage for intermediate data. A query that is forced to spill todisk may have a longer execution time by orders of magnitude than a query thatruns completely in memory.

Please note that enabling spill-to-disk does not guarantee execution of allmemory intensive queries. It is still possible that the query runner will failto divide intermediate data into chunks small enough that every chunk fits intomemory, leading to Out of memory errors while loading the data from disk.

Revocable memory and reserved pool

Both reserved memory pool and revocable memory are designed to cope with low memory conditions.When user memory pool is exhausted then a single query will be promoted to a reserved pool.In such case only that query is allowed to progress thus reducing clusterconcurrency. Revocable memory will try to prevent that by triggering spill.Reserved pool is of query.max-total-memory-per-node size. Ifquery.max-total-memory-per-node is large compared to the total memoryavailable on the node, then the general memory pool may not have enoughmemory to run larger queries. If spill is enabled, then this will causeexcessive spilling for queries that consume large amounts of memory per node.These queries would finish much quicker if spill were disabled because theywould execute in the reserved pool. However, doing so could also significantlyreduce cluster concurrency. In such a situation we recommend disabling thereserved memory pool via the experimental.reserved-pool-enabled configproperty.

Spill Disk Space

Spilling intermediate results to disk and retrieving them back is expensivein terms of IO operations. Thus, queries that use spill likely becomethrottled by disk. To increase query performance it is recommended toprovide multiple paths on separate local devices for spill (propertyspiller-spill-path in Spilling Properties).

The system drive should not be used for spilling, especially not to the drive where the JVMis running and writing logs. Doing so may lead to cluster instability. Additionally,it is recommended to monitor the disk saturation of the configured spill paths.

Presto treats spill paths as independent disks (see JBOD), sothere is no need to use RAID for spill.

Spill Compression

When spill compression is enabled (spill-compression-enabled property inSpilling Properties), spilled pages will be compressed using the sameimplementation as exchange compression when they are sufficiently compressible.Enabling this feature can reduce the amount of disk IO at the costof extra CPU load to compress and decompress spilled pages.

Spill Encryption

When spill encryption is enabled (spill-encryption-enabled property inSpilling Properties), spill contents will be encrypted with a randomly generated(per spill file) secret key. Enabling this will decrease the performance of spillingto disk but can protect spilled data from being recovered from the files written to disk.

Note: Some distributions of Java ship with policy files that limit the strengthof the cryptographic keys that can be used. Spill encryption uses256-bit AES keys and may require Unlimited Strength JCEpolicy files to work correctly.

Supported Operations

Not all operations support spilling to disk, and each handles spillingdifferently. Currently, the mechanism is implemented for the followingoperations.

Joins

During the join operation, one of the tables being joined is stored in memory.This table is called the build table. The rows from the other table streamthrough and are passed onto the next operation if they match rows in the buildtable. The most memory-intensive part of the join is this build table.

When the task concurrency is greater than one, the build table is partitioned.The number of partitions is equal to the value of the task.concurrencyconfiguration parameter (see Task Properties).

When the build table is partitioned, the spill-to-disk mechanism can decreasethe peak memory usage needed by the join operation. When a query approaches thememory limit, a subset of the partitions of the build table gets spilled to disk,along with rows from the other table that fall into those same partitions. Thenumber of partitions that get spilled influences the amount of disk space needed.

Afterward, the spilled partitions are read back one-by-one to finish the joinoperation.

With this mechanism, the peak memory used by the join operator can be decreasedto the size of the largest build table partition. Assuming no data skew, this willbe 1 / task.concurrency times the size of the whole build table.

Aggregations

Aggregation functions perform an operation on a group of values and return onevalue. If the number of groups you’re aggregating over is large, a significantamount of memory may be needed. When spill-to-disk is enabled, if there is notenough memory, intermediate cumulated aggregation results are written to disk.They are loaded back and merged when memory is available.