Slow or Hanging Jobs

Symptom:

The first scan of tuples after bulk data load, modification, or deletion jobs on heap tables are running slow or hanging.

Potential Cause:

When a workload involves a bulk load, modification, or deletion of data in a heap table, the first scan post-operation may generate a large amount of WAL data when checksums are enabled (data_check_sums=true) or hint bits are logged (wal_log_hints=true), leading to slow or hung jobs.

Affected workloads include: restoring from a backup, loading data with gpcopy or COPY, cluster expansion, CTAS/INSERT/UPDATE/DELETE operations, and ALTER TABLE operations that modify tuples.

Explanation:

Greenplum Database uses hint bits to mark tuples as created and/or deleted by transactions. Hint bits, when set, can help in determining visibility of tuples without expensive pg_clog and pg_subtrans commit log lookups.

Hint bits are updated for every tuple on the first scan of the tuple after its creation or deletion. Because hint bits are checked and set on a per-tuple basis, even a read can result in heavy writes. When data checksums are enabled for heap tables (the default), hint bit updates are always WAL-logged.

Solution:

If you have restored or loaded a complete database comprised primarily of heap tables, you may choose to run VACUUM against the entire database.

Alternatively, if you can identify the individual tables affected, you have two options:

  1. Schedule and take a maintenance window and run VACUUM on the specific tables that have been loaded, updated, or deleted in bulk. This operation should scan all of the tuples and set and WAL-log the hint bits, taking the performance hit up-front.

  2. Run SELECT count(*) FROM <table-name> on each table. This operation similarly scans all of the tuples and sets and WAL-logs the hint bits.

All subsequent scans as part of regular workloads on the tables should not be required to generate hints or their accompanying full page image WAL records.