背景

PostgreSQL heap TABLE AM引擎,使用多版本来解决快照问题,版本处于当前数据文件中,有垃圾回收进程进行回收,那么哪些垃圾不能被回收呢?

WAL是PG的REDO文件,哪些WAL不能被回收重复利用?什么情况下可能会一直增长不清理呢?

heap或INDEX的膨胀有些时候并不是因为回收慢,而是有些是无法被回收的垃圾,通常被称为膨胀点。本文对膨胀点进行逐一解释(回收慢不解释,可能: worker太少,io太差,worker睡眠太长或频繁,vacuum mem太少放不下所有垃圾行CTID导致多次扫描索引,launcher唤醒周期太长,表太大未支持并行垃圾回收, …)。

除了snapshot too old以外,12新增AM例如zedstore, zheap将彻底解决heap的垃圾版本带来的膨胀问题。

全局catalog 膨胀点

全局catalog包括tbs,db,role等,如下:

  1. postgres=# select relname from pg_class
  2. where reltablespace in
  3. (select oid from pg_tablespace where spcname='pg_global')
  4. and relkind='r';
  5. relname
  6. -----------------------
  7. pg_authid
  8. pg_subscription
  9. pg_database
  10. pg_db_role_setting
  11. pg_tablespace
  12. pg_pltemplate
  13. pg_auth_members
  14. pg_shdepend
  15. pg_shdescription
  16. pg_replication_origin
  17. pg_shseclabel
  18. (11 rows)

哪些垃圾不能被回收?

1、当前实例中最老事务快照之后产生的垃圾记录

2、SLOT catalog_xmin后产生的垃圾记录

3、年龄小于vacuum_defer_cleanup_age设置的垃圾记录

4、备库开启了feedback后,备库中最老事务快照(包括catalog_xmin, global xmin)之后产生的垃圾记录

什么时候可能膨胀?

1、standby 开启了 feedback (且standby有慢事务, LONG SQL, 慢/dead slot),

2、慢/dead slot(catalog_xmin, 影响catalog垃圾回收),

3、vacuum_defer_cleanup_age 设置太大

4、整个实例中的 : 长事务, 慢SQL, 慢2pc,

库级catalog 膨胀点

库级catalog包括如下:

  1. postgres=#
  2. select relname from pg_class where relkind='r'
  3. and relnamespace ='pg_catalog'::regnamespace
  4. except
  5. select relname from pg_class where reltablespace in
  6. (select oid from pg_tablespace where spcname = 'pg_global')
  7. and relkind='r';
  8. relname
  9. -------------------------
  10. pg_language
  11. pg_sequence
  12. pg_largeobject
  13. pg_policy
  14. pg_ts_template
  15. pg_attrdef
  16. pg_operator
  17. pg_ts_parser
  18. pg_depend
  19. pg_attribute
  20. pg_ts_config
  21. pg_conversion
  22. pg_inherits
  23. pg_subscription_rel
  24. pg_publication
  25. pg_foreign_table
  26. pg_largeobject_metadata
  27. pg_ts_dict
  28. pg_statistic
  29. pg_init_privs
  30. pg_opfamily
  31. pg_type
  32. pg_am
  33. pg_default_acl
  34. pg_proc
  35. pg_index
  36. pg_rewrite
  37. pg_statistic_ext
  38. pg_constraint
  39. pg_opclass
  40. pg_partitioned_table
  41. pg_namespace
  42. pg_trigger
  43. pg_enum
  44. pg_amop
  45. pg_event_trigger
  46. pg_collation
  47. pg_foreign_server
  48. pg_foreign_data_wrapper
  49. pg_user_mapping
  50. pg_description
  51. pg_cast
  52. pg_publication_rel
  53. pg_aggregate
  54. pg_transform
  55. pg_extension
  56. pg_class
  57. pg_seclabel
  58. pg_amproc
  59. pg_range
  60. pg_ts_config_map
  61. (51 rows)

哪些垃圾不能被回收?

1、当前数据库中最老事务快照之后产生的垃圾记录

2、年龄小于vacuum_defer_cleanup_age设置的垃圾记录

3、备库开启了feedback后,备库返回的最老事务快照(包括catalog_xmin, global xmin)之后产生的垃圾记录

4、SLOT catalog_xmin后产生的垃圾记录(create table, drop table, pg_class, pg_att等)。影响全局(所有DB)

什么时候可能膨胀?

1、standby 开启了 feedback (且standby有慢事务, LONG SQL, 慢/dead slot),

2、慢/dead slot(catalog_xmin, 影响catalog垃圾回收),

3、vacuum_defer_cleanup_age 设置太大

4、当前数据库中的 : 长事务, 慢SQL, 慢2pc,

普通对象 膨胀点

用户创建的表、物化视图、索引等。

哪些垃圾不能被回收?

1、当前数据库中最老事务快照之后产生的垃圾记录

2、年龄小于vacuum_defer_cleanup_age设置的垃圾记录

3、备库开启了feedback后,备库返回的最老事务快照(仅指 global xmin)之后产生的垃圾记录。(catalog xmin无影响)

什么时候可能膨胀?

1、standby 开启了 feedback (且standby有慢事务, LONG SQL),

2、vacuum_defer_cleanup_age 设置太大

3、当前数据库中的 : 长事务, 慢SQL, 慢2pc,

WAL文件 膨胀点

wal是指PG的REDO文件。

哪些WAL不能被回收 或 不能被重复利用?

1、从最后一次已正常结束的检查点(检查点开始时刻, 不是结束时刻)开始,所有的REDO文件都不能被回收

2、归档开启后,所有未归档的REDO。(.ready对应的redo文件)

3、启用SLOT后,还没有被SLOT消费的REDO文件

4、设置wal_keep_segments时,当REDO文件数还没有达到wal_keep_segments个时。

什么时候可能膨胀?

1、archive failed ,归档失败

2、user defined archive BUG,用户开启了归档,但是没有正常的将.ready改成.done,使得WAL堆积

3、wal_keep_segments 设置太大,WAL保留过多

4、max_wal_size设置太大,并且checkpoint_completion_target设置太大,导致检查点跨度很大,保留WAL文件很多

5、slot slow(dead) ,包括(physical | logical replication) , restart_lsn 开始的所有WAL文件都要被保留

一些例子

1、创建slot

  1. postgres=# select pg_create_logical_replication_slot('a','test_decoding');
  2. pg_create_logical_replication_slot
  3. ------------------------------------
  4. (a,0/92C9C038)
  5. (1 row)

2、查看slot的位点信息

  1. postgres=# select * from pg_get_replication_slots();
  2. slot_name | plugin | slot_type | datoid | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn
  3. -----------+---------------+-----------+--------+-----------+--------+------------+------+--------------+-------------+---------------------
  4. a | test_decoding | logical | 13585 | f | f | | | 1982645 | 0/92C9BFE8 | 0/92C9C038
  5. (1 row)

3、查看catalog_xmin对应XID的事务提交时间,需要开启事务时间跟踪track_commit_timestamp

  1. postgres=# select pg_xact_commit_timestamp(xmin),pg_xact_commit_timestamp(catalog_xmin) from pg_get_replication_slots();
  2. psql: ERROR: could not get commit timestamp data
  3. HINT: Make sure the configuration parameter "track_commit_timestamp" is set.

4、从RESTART_LSN找到对应WAL文件,从文件中也可以查到大概的时间。

  1. postgres=# select pg_walfile_name(restart_lsn) from pg_get_replication_slots();
  2. pg_walfile_name
  3. --------------------------
  4. 000000010000000000000092
  5. (1 row)
  6. postgres=# select * from pg_stat_file('pg_wal/000000010000000000000092');
  7. size | access | modification | change | creation | isdir
  8. ----------+------------------------+------------------------+------------------------+----------+-------
  9. 16777216 | 2019-06-29 22:56:16+08 | 2019-07-01 09:50:16+08 | 2019-07-01 09:50:16+08 | | f
  10. (1 row)
  11. postgres=# select * from pg_ls_waldir() where name='000000010000000000000092';
  12. name | size | modification
  13. --------------------------+----------+------------------------
  14. 000000010000000000000092 | 16777216 | 2019-07-01 09:50:16+08
  15. (1 row)

5、建表

  1. postgres=# create table b(id int);
  2. CREATE TABLE
  3. postgres=# insert into b values (1);
  4. INSERT 0 1

6、消费SLOT WAL

  1. postgres=# select * from pg_logical_slot_get_changes('a',pg_current_wal_lsn(),1);
  2. lsn | xid | data
  3. ------------+---------+----------------
  4. 0/92C9C0C0 | 1982645 | BEGIN 1982645
  5. 0/92CA4A40 | 1982645 | COMMIT 1982645
  6. (2 rows)
  7. postgres=# select * from pg_logical_slot_get_changes('a',pg_current_wal_lsn(),1);
  8. lsn | xid | data
  9. ------------+---------+---------------------------------------
  10. 0/92CA4A78 | 1982646 | BEGIN 1982646
  11. 0/92CA4A78 | 1982646 | table public.b: INSERT: id[integer]:1
  12. 0/92CA4AE8 | 1982646 | COMMIT 1982646
  13. (3 rows)

7、删除记录

  1. postgres=# delete from b;
  2. DELETE 1

8、垃圾回收,正常。本地表垃圾不受slot catalog_xmin影响

  1. postgres=# vacuum verbose b;
  2. psql: INFO: vacuuming "public.b"
  3. psql: INFO: "b": removed 1 row versions in 1 pages
  4. psql: INFO: "b": found 1 removable, 0 nonremovable row versions in 1 out of 1 pages
  5. DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 1982648
  6. There were 0 unused item identifiers.
  7. Skipped 0 pages due to buffer pins, 0 frozen pages.
  8. 0 pages are entirely empty.
  9. CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
  10. psql: INFO: "b": truncated 1 to 0 pages
  11. DETAIL: CPU: user: 0.09 s, system: 0.00 s, elapsed: 0.09 s
  12. VACUUM

9、建表,删表,使得CATALOG发生变化,产生CATALOG垃圾

  1. postgres=# create table c (id int);
  2. CREATE TABLE
  3. postgres=# drop table c;
  4. DROP TABLE
  5. postgres=# create table c (id int);
  6. CREATE TABLE
  7. postgres=# drop table c;
  8. DROP TABLE

10、垃圾回收catalog,无法回收SLOT后产生的CATALOG垃圾,因为还需要这个CATALOG版本去解析对应WAL的LOGICAL 日志

  1. postgres=# vacuum verbose pg_class;
  2. psql: INFO: vacuuming "pg_catalog.pg_class"
  3. psql: INFO: "pg_class": found 0 removable, 465 nonremovable row versions in 13 out of 13 pages
  4. DETAIL: 2 dead row versions cannot be removed yet, oldest xmin: 1982646
  5. There were 111 unused item identifiers.
  6. Skipped 0 pages due to buffer pins, 0 frozen pages.
  7. 0 pages are entirely empty.
  8. CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
  9. VACUUM

catalog 受影响

  1. postgres=# vacuum verbose pg_attribute ;
  2. psql: INFO: vacuuming "pg_catalog.pg_attribute"
  3. psql: INFO: "pg_attribute": found 0 removable, 293 nonremovable row versions in 6 out of 62 pages
  4. DETAIL: 14 dead row versions cannot be removed yet, oldest xmin: 1982646
  5. There were 55 unused item identifiers.
  6. Skipped 0 pages due to buffer pins, 55 frozen pages.
  7. 0 pages are entirely empty.
  8. CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
  9. VACUUM

11、长事务不影响其他库的垃圾回收

postgres

  1. postgres=# begin;
  2. BEGIN
  3. postgres=# delete from a;
  4. DELETE 1

db1

  1. db1=# create table b(id int);
  2. CREATE TABLE
  3. db1=# insert into b values (1);
  4. INSERT 0 1
  5. db1=# delete from b;
  6. DELETE 1
  7. db1=# vacuum verbose b;
  8. psql: INFO: vacuuming "public.b"
  9. psql: INFO: "b": removed 1 row versions in 1 pages
  10. psql: INFO: "b": found 1 removable, 0 nonremovable row versions in 1 out of 1 pages
  11. DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 1982671
  12. There were 0 unused item identifiers.
  13. Skipped 0 pages due to buffer pins, 0 frozen pages.
  14. 0 pages are entirely empty.
  15. CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
  16. psql: INFO: "b": truncated 1 to 0 pages
  17. DETAIL: CPU: user: 0.09 s, system: 0.00 s, elapsed: 0.09 s
  18. VACUUM

小结

1 全局catalog 膨胀点

哪些垃圾不能被回收?

1、年龄小于vacuum_defer_cleanup_age设置的垃圾记录

2、当前实例中最老事务快照之后产生的垃圾记录

3、SLOT catalog_xmin后产生的垃圾记录

4、备库开启了feedback后,备库中最老事务快照(包括catalog_xmin, global xmin)之后产生的垃圾记录

什么时候可能膨胀?

1、vacuum_defer_cleanup_age 设置太大

2、整个实例中的 : 长事务, 慢SQL, 慢2pc,

3、慢/dead slot(catalog_xmin, 影响catalog垃圾回收),

4、standby 开启了 feedback (且standby有慢事务, LONG SQL, 慢/dead slot),

2 库级catalog 膨胀点

哪些垃圾不能被回收?

1、年龄小于vacuum_defer_cleanup_age设置的垃圾记录

2、当前数据库中最老事务快照之后产生的垃圾记录

3、备库开启了feedback后,备库返回的最老事务快照(包括catalog_xmin, global xmin)之后产生的垃圾记录

4、SLOT catalog_xmin后产生的垃圾记录(create table, drop table, pg_class, pg_att等)。影响全局(所有DB)

什么时候可能膨胀?

1、vacuum_defer_cleanup_age 设置太大

2、当前数据库中的 : 长事务, 慢SQL, 慢2pc,

3、standby 开启了 feedback (且standby有慢事务, LONG SQL, 慢/dead slot),

4、慢/dead slot(catalog_xmin, 影响catalog垃圾回收),

普通对象 膨胀点

用户创建的表、物化视图、索引等。

哪些垃圾不能被回收?

1、年龄小于vacuum_defer_cleanup_age设置的垃圾记录

2、当前数据库中最老事务快照之后产生的垃圾记录

3、备库开启了feedback后,备库返回的最老事务快照(仅指 global xmin)之后产生的垃圾记录。(catalog xmin无影响)

什么时候可能膨胀?

1、vacuum_defer_cleanup_age 设置太大

2、当前数据库中的 : 长事务, 慢SQL, 慢2pc,

3、standby 开启了 feedback (且standby有慢事务, LONG SQL),

WAL文件 膨胀点

wal是指PG的REDO文件。

哪些WAL不能被回收 或 不能被重复利用?

1、从最后一次已正常结束的检查点(检查点开始时刻, 不是结束时刻)开始,所有的REDO文件都不能被回收

2、归档开启后,所有未归档的REDO。(.ready对应的redo文件)

3、启用SLOT后,还没有被SLOT消费的REDO文件

4、设置wal_keep_segments时,当REDO文件数还没有达到wal_keep_segments个时。

什么时候可能膨胀?

1、archive failed ,归档失败

2、user defined archive BUG,用户开启了归档,但是没有正常的将.ready改成.done,使得WAL堆积

3、wal_keep_segments 设置太大,WAL保留过多

4、max_wal_size设置太大,并且checkpoint_completion_target设置太大,导致检查点跨度很大,保留WAL文件很多

5、slot slow(dead) ,包括(physical | logical replication) , restart_lsn 开始的所有WAL文件都要被保留

参考

  1. switch (HeapTupleSatisfiesVacuum(&tuple, OldestXmin, buf))
  2. {
  3. case HEAPTUPLE_DEAD:
  4. /*
  5. * Ordinarily, DEAD tuples would have been removed by
  6. * heap_page_prune(), but it's possible that the tuple
  7. * state changed since heap_page_prune() looked. In
  8. * particular an INSERT_IN_PROGRESS tuple could have
  9. * changed to DEAD if the inserter aborted. So this
  10. * cannot be considered an error condition.
  11. *
  12. * If the tuple is HOT-updated then it must only be
  13. * removed by a prune operation; so we keep it just as if
  14. * it were RECENTLY_DEAD. Also, if it's a heap-only
  15. * tuple, we choose to keep it, because it'll be a lot
  16. * cheaper to get rid of it in the next pruning pass than
  17. * to treat it like an indexed tuple. Finally, if index
  18. * cleanup is disabled, the second heap pass will not
  19. * execute, and the tuple will not get removed, so we must
  20. * treat it like any other dead tuple that we choose to
  21. * keep.
  22. *
  23. * If this were to happen for a tuple that actually needed
  24. * to be deleted, we'd be in trouble, because it'd
  25. * possibly leave a tuple below the relation's xmin
  26. * horizon alive. heap_prepare_freeze_tuple() is prepared
  27. * to detect that case and abort the transaction,
  28. * preventing corruption.
  29. */
  30. if (HeapTupleIsHotUpdated(&tuple) ||
  31. HeapTupleIsHeapOnly(&tuple) ||
  32. params->index_cleanup == VACOPT_TERNARY_DISABLED)
  33. nkeep += 1;
  34. else
  35. tupgone = true; /* we can delete the tuple */
  36. all_visible = false;
  37. break;
  38. case HEAPTUPLE_RECENTLY_DEAD:
  39. /*
  40. * If tuple is recently deleted then we must not remove it
  41. * from relation.
  42. */
  43. nkeep += 1;
  44. all_visible = false;
  45. break;

src/backend/access/heap/heapam_visibility.c

  1. * HeapTupleSatisfiesVacuum()
  2. * visible to any running transaction, used by VACUUM
  1. /*
  2. * HeapTupleSatisfiesVacuum
  3. *
  4. * Determine the status of tuples for VACUUM purposes. Here, what
  5. * we mainly want to know is if a tuple is potentially visible to *any*
  6. * running transaction. If so, it can't be removed yet by VACUUM.
  7. *
  8. * OldestXmin is a cutoff XID (obtained from GetOldestXmin()). Tuples
  9. * deleted by XIDs >= OldestXmin are deemed "recently dead"; they might
  10. * still be visible to some open transaction, so we can't remove them,
  11. * even if we see that the deleting transaction has committed.
  12. */
  13. HTSV_Result
  14. HeapTupleSatisfiesVacuum(HeapTuple htup, TransactionId OldestXmin,
  15. Buffer buffer)
  16. /*
  17. * Deleter committed, but perhaps it was recent enough that some open
  18. * transactions could still see the tuple.
  19. */
  20. if (!TransactionIdPrecedes(HeapTupleHeaderGetRawXmax(tuple), OldestXmin))
  21. return HEAPTUPLE_RECENTLY_DEAD;
  22. else if (TransactionIdDidCommit(xmax))
  23. {
  24. /*
  25. * The multixact might still be running due to lockers. If the
  26. * updater is below the xid horizon, we have to return DEAD
  27. * regardless -- otherwise we could end up with a tuple where the
  28. * updater has to be removed due to the horizon, but is not pruned
  29. * away. It's not a problem to prune that tuple, because any
  30. * remaining lockers will also be present in newer tuple versions.
  31. */
  32. if (!TransactionIdPrecedes(xmax, OldestXmin))
  33. return HEAPTUPLE_RECENTLY_DEAD;
  34. return HEAPTUPLE_DEAD;
  35. }

https://www.postgresql.org/docs/12/protocol-replication.html

  1. Hot Standby feedback message (F)
  2. Byte1('h')
  3. Identifies the message as a Hot Standby feedback message.
  4. Int64
  5. The client's system clock at the time of transmission, as microseconds since midnight on 2000-01-01.
  6. Int32
  7. The standby's current global xmin, excluding the catalog_xmin from any replication slots. If both this value and the following catalog_xmin are 0 this is treated as a notification that Hot Standby feedback will no longer be sent on this connection. Later non-zero messages may reinitiate the feedback mechanism.
  8. Int32
  9. The epoch of the global xmin xid on the standby.
  10. Int32
  11. The lowest catalog_xmin of any replication slots on the standby. Set to 0 if no catalog_xmin exists on the standby or if hot standby feedback is being disabled.
  12. Int32
  13. The epoch of the catalog_xmin xid on the standby.

《解读用户最常问的PostgreSQL垃圾回收、膨胀、多版本管理、存储引擎等疑惑 - 经典》

《PostgreSQL 自动vacuum(垃圾回收)与手动vacuum的参数详解》

《PostgreSQL 12 preview - heap table末端垃圾page是否在vacuum时被回收可控 - vacuum_truncate》

《PostgreSQL 垃圾回收参数优化之 - maintenance_work_mem , autovacuum_work_mem》

《Greenplum 垃圾回收、收集统计信息调度 - vacuum analyze 所有表 - 注意锁问题》

《PostgreSQL 垃圾版本引入的索引扫描性能下降诊断》

《Greenplum 清理垃圾、修改存储模式(行列变换) 平滑方法 - 交换数据、交换分区》

《Greenplum 列存表(AO表)的膨胀、垃圾检查与空间收缩(含修改分布键)》

《如何检测、清理Greenplum膨胀、垃圾(含修改分布键) - 阿里云HybridDB for PG最佳实践》

《影响或控制PostgreSQL垃圾回收的参数或因素》

《PostgreSQL 10.0 preview 功能增强 - SQL执行剩余时间 - 垃圾回收过程可视pg_stat_progress_vacuum》

《PostgreSQL物理”备库”的哪些操作或配置,可能影响”主库”的性能、垃圾回收、IO波动》

《PostgreSQL 老湿机图解平安科技遇到的垃圾回收”坑”》

《PostgreSQL垃圾回收代码分析 - why postgresql cann’t reclaim tuple is HEAPTUPLE_RECENTLY_DEAD》

《PostgreSQL 垃圾回收原理以及如何预防膨胀 - How to prevent object bloat in PostgreSQL》