背景

经过多年的酝酿(从支持work process到支持动态fork共享内存,再到内核层面支持并行计算),PostgreSQL 的多核并行计算功能终于在2016年发布的9.6版本中正式上线,为PG的scale up能力再次拔高一个台阶,标志着开源数据库已经攻克了并行计算的难题。

相信有很多小伙伴已经开始测试了。

在32物理核的机器上进行了测试,重计算的场景,性能程线性提升。

目前并行计算支持全表扫描,JOIN,聚合。

一、快速安装PostgreSQL 9.6

为了让大伙能够快速用上9.6,以下是一个简单的安装说明。

OS 准备

  1. # yum -y install coreutils glib2 lrzsz sysstat e4fsprogs xfsprogs ntp readline-devel zlib zlib-devel openssl openssl-devel pam-devel libxml2-devel libxslt-devel python-devel tcl-devel gcc make smartmontools flex bison perl perl-devel perl-ExtUtils* openldap openldap-devel
  2. # vi /etc/sysctl.conf
  3. # add by digoal.zhou
  4. fs.aio-max-nr = 1048576
  5. fs.file-max = 76724600
  6. kernel.core_pattern= /data01/corefiles/core_%e_%u_%t_%s.%p
  7. # /data01/corefiles事先建好,权限777
  8. kernel.sem = 4096 2147483647 2147483646 512000
  9. # 信号量, ipcs -l 或 -u 查看,每16个进程一组,每组信号量需要17个信号量。
  10. kernel.shmall = 107374182
  11. # 所有共享内存段相加大小限制(建议内存的80%)
  12. kernel.shmmax = 274877906944
  13. # 最大单个共享内存段大小(建议为内存一半), >9.2的版本已大幅降低共享内存的使用
  14. kernel.shmmni = 819200
  15. # 一共能生成多少共享内存段,每个PG数据库集群至少2个共享内存段
  16. net.core.netdev_max_backlog = 10000
  17. net.core.rmem_default = 262144
  18. # The default setting of the socket receive buffer in bytes.
  19. net.core.rmem_max = 4194304
  20. # The maximum receive socket buffer size in bytes
  21. net.core.wmem_default = 262144
  22. # The default setting (in bytes) of the socket send buffer.
  23. net.core.wmem_max = 4194304
  24. # The maximum send socket buffer size in bytes.
  25. net.core.somaxconn = 4096
  26. net.ipv4.tcp_max_syn_backlog = 4096
  27. net.ipv4.tcp_keepalive_intvl = 20
  28. net.ipv4.tcp_keepalive_probes = 3
  29. net.ipv4.tcp_keepalive_time = 60
  30. net.ipv4.tcp_mem = 8388608 12582912 16777216
  31. net.ipv4.tcp_fin_timeout = 5
  32. net.ipv4.tcp_synack_retries = 2
  33. net.ipv4.tcp_syncookies = 1
  34. # 开启SYN Cookies。当出现SYN等待队列溢出时,启用cookie来处理,可防范少量的SYN攻击
  35. net.ipv4.tcp_timestamps = 1
  36. # 减少time_wait
  37. net.ipv4.tcp_tw_recycle = 0
  38. # 如果=1则开启TCP连接中TIME-WAIT套接字的快速回收,但是NAT环境可能导致连接失败,建议服务端关闭它
  39. net.ipv4.tcp_tw_reuse = 1
  40. # 开启重用。允许将TIME-WAIT套接字重新用于新的TCP连接
  41. net.ipv4.tcp_max_tw_buckets = 262144
  42. net.ipv4.tcp_rmem = 8192 87380 16777216
  43. net.ipv4.tcp_wmem = 8192 65536 16777216
  44. net.nf_conntrack_max = 1200000
  45. net.netfilter.nf_conntrack_max = 1200000
  46. vm.dirty_background_bytes = 409600000
  47. # 系统脏页到达这个值,系统后台刷脏页调度进程 pdflush(或其他) 自动将(dirty_expire_centisecs/100)秒前的脏页刷到磁盘
  48. vm.dirty_expire_centisecs = 3000
  49. # 比这个值老的脏页,将被刷到磁盘。3000表示30秒。
  50. vm.dirty_ratio = 95
  51. # 如果系统进程刷脏页太慢,使得系统脏页超过内存 95 % 时,则用户进程如果有写磁盘的操作(如fsync, fdatasync等调用),则需要主动把系统脏页刷出。
  52. # 有效防止用户进程刷脏页,在单机多实例,并且使用CGROUP限制单实例IOPS的情况下非常有效。
  53. vm.dirty_writeback_centisecs = 100
  54. # pdflush(或其他)后台刷脏页进程的唤醒间隔, 100表示1秒。
  55. vm.extra_free_kbytes = 4096000
  56. vm.min_free_kbytes = 2097152
  57. vm.mmap_min_addr = 65536
  58. vm.overcommit_memory = 0
  59. # 在分配内存时,允许少量over malloc, 如果设置为 1, 则认为总是有足够的内存,内存较少的测试环境可以使用 1 .
  60. vm.overcommit_ratio = 90
  61. # 当overcommit_memory = 2 时,用于参与计算允许指派的内存大小。
  62. vm.swappiness = 0
  63. # 关闭交换分区
  64. vm.zone_reclaim_mode = 0
  65. # 禁用 numa, 或者在vmlinux中禁止.
  66. net.ipv4.ip_local_port_range = 40000 65535
  67. # 本地自动分配的TCP, UDP端口号范围
  68. # vm.nr_hugepages = 102352
  69. # 建议shared buffer设置超过64GB时 使用大页,页大小 /proc/meminfo Hugepagesize
  70. # sysctl -p
  71. # vi /etc/security/limits.conf
  72. * soft nofile 1024000
  73. * hard nofile 1024000
  74. * soft nproc unlimited
  75. * hard nproc unlimited
  76. * soft core unlimited
  77. * hard core unlimited
  78. * soft memlock unlimited
  79. * hard memlock unlimited
  80. # rm -f /etc/security/limits.d/*

安装

  1. $ wget https://ftp.postgresql.org/pub/source/v9.6.0/postgresql-9.6.0.tar.bz2
  2. $ tar -jxvf postgresql-9.6.0.tar.bz2
  3. $ cd postgresql-9.6.0
  4. $ ./configure --prefix=/home/digoal/pgsql9.6.0
  5. $ make world -j 32
  6. $ make install-world -j 32
  7. $ vi ~/.bash_profile
  8. export PS1="$USER@`/bin/hostname -s`-> "
  9. export PGPORT=5281
  10. export PGDATA=/u02/digoal/pg_root$PGPORT
  11. export LANG=en_US.utf8
  12. export PGHOME=/home/digoal/pgsql9.6.0
  13. export LD_LIBRARY_PATH=$PGHOME/lib:/lib64:/usr/lib64:/usr/local/lib64:/lib:/usr/lib:/usr/local/lib:$LD_LIBRARY_PATH
  14. export DATE=`date +"%Y%m%d%H%M"`
  15. export PATH=$PGHOME/bin:$PATH:.
  16. export MANPATH=$PGHOME/share/man:$MANPATH
  17. export PGHOST=$PGDATA
  18. export PGUSER=postgres
  19. export PGDATABASE=postgres
  20. alias rm='rm -i'
  21. alias ll='ls -lh'
  22. unalias vi
  23. $ . ~/.bash_profile
  24. $ df -h
  25. /dev/mapper/vgdata01-lv03
  26. 4.0T 1.3T 2.8T 32% /u01
  27. /dev/mapper/vgdata01-lv04
  28. 7.7T 899G 6.8T 12% /u02

初始化集群

  1. $ initdb -D $PGDATA -E UTF8 --locale=C -U postgres -X /u01/digoal/pg_xlog$PGPORT

配置数据库参数

  1. $ cd $PGDATA
  2. $ vi postgresql.conf
  3. listen_addresses = '0.0.0.0'
  4. port = 5281
  5. max_connections = 800
  6. superuser_reserved_connections = 13
  7. unix_socket_directories = '.'
  8. unix_socket_permissions = 0700
  9. tcp_keepalives_idle = 60
  10. tcp_keepalives_interval = 10
  11. tcp_keepalives_count = 10
  12. shared_buffers = 128GB
  13. huge_pages = try
  14. maintenance_work_mem = 2GB
  15. dynamic_shared_memory_type = sysv
  16. vacuum_cost_delay = 0
  17. bgwriter_delay = 10ms
  18. bgwriter_lru_maxpages = 1000
  19. bgwriter_lru_multiplier = 10.0
  20. bgwriter_flush_after = 256
  21. max_worker_processes = 128
  22. max_parallel_workers_per_gather = 16
  23. old_snapshot_threshold = 8h
  24. backend_flush_after = 256
  25. synchronous_commit = off
  26. full_page_writes = off
  27. wal_buffers = 128MB
  28. wal_writer_delay = 10ms
  29. wal_writer_flush_after = 4MB
  30. checkpoint_timeout = 55min
  31. max_wal_size = 256GB
  32. checkpoint_flush_after = 1MB
  33. random_page_cost = 1.0
  34. effective_cache_size = 512GB
  35. constraint_exclusion = on
  36. log_destination = 'csvlog'
  37. logging_collector = on
  38. log_checkpoints = on
  39. log_connections = on
  40. log_disconnections = on
  41. log_error_verbosity = verbose
  42. log_timezone = 'PRC'
  43. autovacuum = on
  44. log_autovacuum_min_duration = 0
  45. autovacuum_max_workers = 8
  46. autovacuum_naptime = 10s
  47. autovacuum_vacuum_scale_factor = 0.02
  48. autovacuum_analyze_scale_factor = 0.01
  49. statement_timeout = 0
  50. lock_timeout = 0
  51. idle_in_transaction_session_timeout = 0
  52. gin_fuzzy_search_limit = 0
  53. gin_pending_list_limit = 4MB
  54. datestyle = 'iso, mdy'
  55. timezone = 'PRC'
  56. lc_messages = 'C'
  57. lc_monetary = 'C'
  58. lc_numeric = 'C'
  59. lc_time = 'C'
  60. default_text_search_config = 'pg_catalog.english'
  61. deadlock_timeout = 1s
  62. $ vi pg_hba.conf
  63. local all all trust
  64. host all all 127.0.0.1/32 trust
  65. host all all ::1/128 trust
  66. host all all 0.0.0.0/0 md5

启动数据库

  1. $ pg_ctl start

二、多核并行计算相关参数与用法

1. 控制整个数据库集群同时能开启多少个work process,必须设置。

  1. max_worker_processes = 128 # (change requires restart)

2. 控制一个并行的EXEC NODE最多能开启多少个并行处理单元,同时还需要参考表级参数parallel_workers,或者PG内核内置的算法,根据表的大小计算需要开启多少和并行处理单元。
实际取小的。

  1. max_parallel_workers_per_gather = 16 # taken from max_worker_processes

3. 计算并行处理的成本,如果成本高于非并行,则不会开启并行处理。

  1. #parallel_tuple_cost = 0.1 # same scale as above
  2. #parallel_setup_cost = 1000.0 # same scale as above

4. 小于这个值的表,不会开启并行。

  1. #min_parallel_relation_size = 8MB

5. 告诉优化器,强制开启并行。

  1. #force_parallel_mode = off

6. 表级参数,不通过表的大小计算并行度,而是直接告诉优化器这个表需要开启多少个并行计算单元。

  1. parallel_workers (integer)
  2. This sets the number of workers that should be used to assist a parallel scan of this table.
  3. If not set, the system will determine a value based on the relation size.
  4. The actual number of workers chosen by the planner may be less, for example due to the setting of max_worker_processes.

三、测试场景描述

在标签系统中,通常会有多个属性,每个属性使用一个标签标示,最简单的标签是用0和1来表示,代表true和false。

我们可以把所有的标签转换成比特位,例如系统中一共有200个标签,5000万用户。

那么我们可以通过标签的位运算来圈定特定的人群。

这样就会涉及BIT位的运算。

那么我们来看看PostgreSQL位运算的性能如何?

四、测试1 (数据量大于shared buffer)

创建一张测试表,包含一个比特位字段,后面用于测试。

  1. postgres=# create unlogged table t_bit2 (id bit(200)) with (autovacuum_enabled=off, parallel_workers=128);
  2. CREATE TABLE

并行插入32亿记录

  1. for ((i=1;i<=64;i++)) ; do psql -c "insert into t_bit2 select B'10101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010' from generate_series(1,50000000);" & done

单表32亿,180GB

  1. postgres=# \dt+
  2. List of relations
  3. Schema | Name | Type | Owner | Size | Description
  4. --------+--------+-------+----------+--------+-------------
  5. public | t_bit2 | table | postgres | 180 GB |

全表扫描测试

非并行模式

  1. postgres=# set force_parallel_mode =off;
  2. SET
  3. postgres=# set max_parallel_workers_per_gather =0;
  4. SET
  5. postgres=# \timing
  6. Timing is on.
  7. 执行计划
  8. postgres=# explain (verbose,costs) select * from t_bit2 where bitand(id, '10101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010')=B'10101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101011';
  9. QUERY PLAN
  10. Seq Scan on public.t_bit2 (cost=0.00..71529415.52 rows=16000001 width=32)
  11. Output: id
  12. Filter: (bitand(t_bit2.id, B'10101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010'::"bit") = B'10101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101011'::"bit")
  13. (3 rows)

取测试三轮后的结果,排除CACHE影响。

  1. postgres=# explain (analyze,verbose,costs) select * from t_bit2 where bitand(id, '10101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010')=B'10101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101011';
  2. Seq Scan on public.t_bit2 (cost=0.00..71529415.52 rows=16000001 width=32) (actual time=0.033..1135403.694 rows=3200000000 loops=1)
  3. Output: id
  4. Filter: (bitand(t_bit2.id, B'10101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010'::"bit") = B'10101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101011'::"bit")
  5. Planning time: 0.576 ms
  6. Execution time: 1285437.199 ms
  7. (5 rows)
  8. Time: 1285438.195 ms

并行模式

  1. postgres=# set force_parallel_mode =on;
  2. postgres=# set max_parallel_workers_per_gather = 64;

取测试三轮后的结果,排除CACHE影响。

  1. postgres=# explain (analyze,verbose,costs) select * from t_bit2 where bitand(id, '10101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010')=B'10101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101011';
  2. Gather (cost=1000.00..26630413.18 rows=16000001 width=32) (actual time=30946.103..30946.103 rows=0 loops=1)
  3. Output: id
  4. Workers Planned: 32
  5. Workers Launched: 32
  6. -> Parallel Seq Scan on public.t_bit2 (cost=0.00..25029413.08 rows=500000 width=32) (actual time=30941.191..30941.191 rows=0 loops=33)
  7. Output: id
  8. Filter: (bitand(t_bit2.id, B'10101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010'::"bit") = B'10101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101011'::"bit")
  9. Rows Removed by Filter: 96969697
  10. Worker 0: actual time=30938.594..30938.594 rows=0 loops=1
  11. Worker 1: actual time=30939.353..30939.353 rows=0 loops=1
  12. Worker 2: actual time=30939.419..30939.419 rows=0 loops=1
  13. Worker 3: actual time=30939.574..30939.574 rows=0 loops=1
  14. Worker 4: actual time=30939.692..30939.692 rows=0 loops=1
  15. Worker 5: actual time=30939.825..30939.825 rows=0 loops=1
  16. Worker 6: actual time=30939.850..30939.850 rows=0 loops=1
  17. Worker 7: actual time=30940.028..30940.028 rows=0 loops=1
  18. Worker 8: actual time=30940.287..30940.287 rows=0 loops=1
  19. Worker 9: actual time=30940.466..30940.466 rows=0 loops=1
  20. Worker 10: actual time=30940.436..30940.436 rows=0 loops=1
  21. Worker 11: actual time=30940.649..30940.649 rows=0 loops=1
  22. Worker 12: actual time=30940.733..30940.733 rows=0 loops=1
  23. Worker 13: actual time=30940.818..30940.818 rows=0 loops=1
  24. Worker 14: actual time=30941.083..30941.083 rows=0 loops=1
  25. Worker 15: actual time=30941.086..30941.086 rows=0 loops=1
  26. Worker 16: actual time=30940.612..30940.612 rows=0 loops=1
  27. Worker 17: actual time=30941.342..30941.342 rows=0 loops=1
  28. Worker 18: actual time=30941.617..30941.617 rows=0 loops=1
  29. Worker 19: actual time=30941.667..30941.667 rows=0 loops=1
  30. Worker 20: actual time=30941.730..30941.730 rows=0 loops=1
  31. Worker 21: actual time=30941.207..30941.207 rows=0 loops=1
  32. Worker 22: actual time=30942.115..30942.115 rows=0 loops=1
  33. Worker 23: actual time=30942.049..30942.049 rows=0 loops=1
  34. Worker 24: actual time=30941.440..30941.440 rows=0 loops=1
  35. Worker 25: actual time=30942.361..30942.361 rows=0 loops=1
  36. Worker 26: actual time=30942.562..30942.562 rows=0 loops=1
  37. Worker 27: actual time=30942.430..30942.430 rows=0 loops=1
  38. Worker 28: actual time=30942.697..30942.697 rows=0 loops=1
  39. Worker 29: actual time=30942.577..30942.577 rows=0 loops=1
  40. Worker 30: actual time=30942.985..30942.985 rows=0 loops=1
  41. Worker 31: actual time=30942.356..30942.356 rows=0 loops=1
  42. Planning time: 0.061 ms
  43. Execution time: 32566.303 ms
  44. (42 rows)

pic1

聚合测试

非并行模式

  1. postgres=# set force_parallel_mode =off;
  2. postgres=# set max_parallel_workers_per_gather = 0;
  3. postgres=# select count(*) from t_bit2 where bitand(id, '10101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010')=B'10101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101011';
  4. count
  5. -------
  6. 0
  7. (1 row)
  8. Time: 810115.643 ms

并行模式

  1. postgres=# set force_parallel_mode =on;
  2. postgres=# set max_parallel_workers_per_gather = 32;
  3. postgres=# select count(*) from t_bit2 where bitand(id, '10101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010')=B'10101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101011';
  4. count
  5. -------
  6. 0
  7. (1 row)
  8. Time: 31805.820 ms

pic2

五、测试2 (数据量小于shared buffer)

创建一张测试表,包含一个比特位字段,后面用于测试。

  1. postgres=# create unlogged table t_bit1 (id bit(200)) with (autovacuum_enabled=off, parallel_workers=128);
  2. CREATE TABLE

并行插入10亿记录

  1. for ((i=1;i<=50;i++)) ; do psql -c "insert into t_bit1 select B'10101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010' from generate_series(1,20000000);" & done

单表10亿,56GB

  1. postgres=# \dt+
  2. List of relations
  3. Schema | Name | Type | Owner | Size | Description
  4. --------+--------+-------+----------+--------+-------------
  5. public | t_bit1 | table | postgres | 56 GB |

聚合测试

非并行模式

  1. postgres=# set force_parallel_mode =off;
  2. postgres=# set max_parallel_workers_per_gather = 0;
  3. postgres=# select count(*) from t_bit1 where bitand(id, '10101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010')=B'10101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101011';
  4. Time: 261679.060 ms

并行模式

  1. postgres=# set force_parallel_mode = on;
  2. postgres=# set max_parallel_workers_per_gather = 32;
  3. postgres=# select count(*) from t_bit1 where bitand(id, '10101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010')=B'10101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101011';
  4. Time: 9704.983 ms

pic3

hash JOIN测试

1亿 JOIN 5000万

  1. create unlogged table t1(id int, info text) with (autovacuum_enabled=off, parallel_workers=128);
  2. create unlogged table t2(id int, info text) with (autovacuum_enabled=off, parallel_workers=128);
  3. insert into t1 select generate_series(1,100000000);
  4. insert into t2 select generate_series(1,50000000);

非并行模式

  1. postgres=# set force_parallel_mode =off;
  2. postgres=# set max_parallel_workers_per_gather = 0;
  3. postgres=# set enable_merjoin=off;
  4. postgres=# explain verbose select count(*) from t1 join t2 using(id);
  5. QUERY PLAN
  6. ---------------------------------------------------------------------------------------
  7. Aggregate (cost=296050602904.73..296050602904.74 rows=1 width=8)
  8. Output: count(*)
  9. -> Hash Join (cost=963185.44..276314071764.57 rows=7894612456066 width=0)
  10. Hash Cond: (t1.id = t2.id)
  11. -> Seq Scan on public.t1 (cost=0.00..1004425.06 rows=56194706 width=4)
  12. Output: t1.id
  13. -> Hash (cost=502212.53..502212.53 rows=28097353 width=4)
  14. Output: t2.id
  15. -> Seq Scan on public.t2 (cost=0.00..502212.53 rows=28097353 width=4)
  16. Output: t2.id
  17. (10 rows)
  18. postgres=# select count(*) from t1 join t2 using(id);
  19. count
  20. ----------
  21. 50000000
  22. (1 row)
  23. Time: 60630.148 ms

并行模式

  1. postgres=# set force_parallel_mode = on;
  2. postgres=# set max_parallel_workers_per_gather = 32;
  3. postgres=# explain verbose select count(*) from t1 join t2 using(id);
  4. QUERY PLAN
  5. -----------------------------------------------------------------------------------------------------
  6. Finalize Aggregate (cost=28372817100.45..28372817100.46 rows=1 width=8)
  7. Output: count(*)
  8. -> Gather (cost=28372817097.16..28372817100.37 rows=32 width=8)
  9. Output: (PARTIAL count(*))
  10. Workers Planned: 32
  11. -> Partial Aggregate (cost=28372816097.16..28372816097.17 rows=1 width=8)
  12. Output: PARTIAL count(*)
  13. -> Hash Join (cost=963185.44..8636284956.99 rows=7894612456066 width=0)
  14. Hash Cond: (t1.id = t2.id)
  15. -> Parallel Seq Scan on public.t1 (cost=0.00..460038.85 rows=1756085 width=4)
  16. Output: t1.id
  17. -> Hash (cost=502212.53..502212.53 rows=28097353 width=4)
  18. Output: t2.id
  19. -> Seq Scan on public.t2 (cost=0.00..502212.53 rows=28097353 width=4)
  20. Output: t2.id
  21. (15 rows)
  22. select count(*) from t1 join t2 using(id);
  23. Execution time: 50958.985 ms
  24. postgres=# set max_parallel_workers_per_gather = 4;
  25. select count(*) from t1 join t2 using(id);
  26. Time: 39386.647 ms

pic4

建议JOIN不要设置太大的并行度。

六、如何设置并行度以及源码分析

GUC变量
1. 控制整个数据库集群同时能开启多少个work process,必须设置。

  1. max_worker_processes = 128 # (change requires restart)

2. 控制一个并行的EXEC NODE最多能开启多少个并行处理单元,同时还需要参考表级参数parallel_workers,或者PG内核内置的算法,根据表的大小计算需要开启多少和并行处理单元。
实际取小的。

  1. max_parallel_workers_per_gather = 16 # taken from max_worker_processes

如果同时还设置了表的并行度parallel_workers,则最终并行度取min(max_parallel_degree , parallel_degree )

  1. /*
  2. * Use the table parallel_degree, but don't go further than
  3. * max_parallel_degree.
  4. */
  5. parallel_degree = Min(rel->rel_parallel_degree, max_parallel_degree);

如果表没有设置并行度parallel_workers ,则根据表的大小 和 parallel_threshold 这个硬编码值决定,计算得出(见函数create_plain_partial_paths)

依旧受到max_parallel_workers_per_gather 参数的限制,不能大于它,取小的,前面已经交代了。

代码如下(release后可能有些许修改)

src/backend/optimizer/util/plancat.c

  1. void
  2. get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
  3. RelOptInfo *rel)
  4. {
  5. ...
  6. /* Retrive the parallel_degree reloption, if set. */
  7. rel->rel_parallel_degree = RelationGetParallelDegree(relation, -1);
  8. ...

src/include/utils/rel.h

  1. /*
  2. * RelationGetParallelDegree
  3. * Returns the relation's parallel_degree. Note multiple eval of argument!
  4. */
  5. #define RelationGetParallelDegree(relation, defaultpd) \
  6. ((relation)->rd_options ? \
  7. ((StdRdOptions *) (relation)->rd_options)->parallel_degree : (defaultpd))

src/backend/optimizer/path/allpaths.c

  1. /*
  2. * create_plain_partial_paths
  3. * Build partial access paths for parallel scan of a plain relation
  4. */
  5. static void
  6. create_plain_partial_paths(PlannerInfo *root, RelOptInfo *rel)
  7. {
  8. int parallel_degree = 1;
  9. /*
  10. * If the user has set the parallel_degree reloption, we decide what to do
  11. * based on the value of that option. Otherwise, we estimate a value.
  12. */
  13. if (rel->rel_parallel_degree != -1)
  14. {
  15. /*
  16. * If parallel_degree = 0 is set for this relation, bail out. The
  17. * user does not want a parallel path for this relation.
  18. */
  19. if (rel->rel_parallel_degree == 0)
  20. return;
  21. /*
  22. * Use the table parallel_degree, but don't go further than
  23. * max_parallel_degree.
  24. */
  25. parallel_degree = Min(rel->rel_parallel_degree, max_parallel_degree);
  26. }
  27. else
  28. {
  29. int parallel_threshold = 1000;
  30. /*
  31. * If this relation is too small to be worth a parallel scan, just
  32. * return without doing anything ... unless it's an inheritance child.
  33. * In that case, we want to generate a parallel path here anyway. It
  34. * might not be worthwhile just for this relation, but when combined
  35. * with all of its inheritance siblings it may well pay off.
  36. */
  37. if (rel->pages < parallel_threshold &&
  38. rel->reloptkind == RELOPT_BASEREL)
  39. return;
  40. // 表级并行度没有设置时,通过表的大小和parallel_threshold 计算并行度
  41. /*
  42. * Limit the degree of parallelism logarithmically based on the size
  43. * of the relation. This probably needs to be a good deal more
  44. * sophisticated, but we need something here for now.
  45. */
  46. while (rel->pages > parallel_threshold * 3 &&
  47. parallel_degree < max_parallel_degree)
  48. {
  49. parallel_degree++;
  50. parallel_threshold *= 3;
  51. if (parallel_threshold >= PG_INT32_MAX / 3)
  52. break;
  53. }
  54. }
  55. /* Add an unordered partial path based on a parallel sequential scan. */
  56. add_partial_path(rel, create_seqscan_path(root, rel, NULL, parallel_degree));
  57. }

3. 计算并行处理的成本,如果成本高于非并行,则不会开启并行处理。

  1. #parallel_tuple_cost = 0.1 # same scale as above
  2. #parallel_setup_cost = 1000.0 # same scale as above

4. 小于这个值的表,不会开启并行。

  1. #min_parallel_relation_size = 8MB

5. 告诉优化器,强制开启并行。

  1. #force_parallel_mode = off

表级参数
6. 不通过表的大小计算并行度,而是直接告诉优化器这个表需要开启多少个并行计算单元。

  1. parallel_workers (integer)
  2. This sets the number of workers that should be used to assist a parallel scan of this table.
  3. If not set, the system will determine a value based on the relation size.
  4. The actual number of workers chosen by the planner may be less, for example due to the setting of max_worker_processes.

七、参考信息

1. http://www.postgresql.org/docs/9.6/static/sql-createtable.html

  1. parallel_workers (integer)
  2. This sets the number of workers that should be used to assist a parallel scan of this table.
  3. If not set, the system will determine a value based on the relation size.
  4. The actual number of workers chosen by the planner may be less, for example due to the setting of max_worker_processes.

2. http://www.postgresql.org/docs/9.6/static/runtime-config-query.html#RUNTIME-CONFIG-QUERY-OTHER

  1. force_parallel_mode (enum)
  2. Allows the use of parallel queries for testing purposes even in cases where no performance benefit is expected.
  3. The allowed values of force_parallel_mode are off (use parallel mode only when it is expected to improve performance),
  4. on (force parallel query for all queries for which it is thought to be safe),
  5. and regress (like on, but with additional behavior changes as explained below).
  6. More specifically, setting this value to on will add a Gather node to the top of any query plan for which this appears to be safe,
  7. so that the query runs inside of a parallel worker. Even when a parallel worker is not available or cannot be used,
  8. operations such as starting a subtransaction that would be prohibited in a parallel query context will be prohibited unless the planner believes that this will cause the query to fail.
  9. If failures or unexpected results occur when this option is set, some functions used by the query may need to be marked PARALLEL UNSAFE (or, possibly, PARALLEL RESTRICTED).
  10. Setting this value to regress has all of the same effects as setting it to on plus some additional effects that are intended to facilitate automated regression testing. Normally,
  11. messages from a parallel worker include a context line indicating that, but a setting of regress suppresses this line so that the output is the same as in non-parallel execution.
  12. Also, the Gather nodes added to plans by this setting are hidden in EXPLAIN output so that the output matches what would be obtained if this setting were turned off.

3. http://www.postgresql.org/docs/9.6/static/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-ASYNC-BEHAVIOR

  1. max_worker_processes (integer)
  2. Sets the maximum number of background processes that the system can support. This parameter can only be set at server start. The default is 8.
  3. When running a standby server, you must set this parameter to the same or higher value than on the master server. Otherwise, queries will not be allowed in the standby server.
  4. max_parallel_workers_per_gather (integer)
  5. Sets the maximum number of workers that can be started by a single Gather node. Parallel workers are taken from the pool of processes established by max_worker_processes.
  6. Note that the requested number of workers may not actually be available at run time. If this occurs, the plan will run with fewer workers than expected,
  7. which may be inefficient. Setting this value to 0, which is the default, disables parallel query execution.
  8. Note that parallel queries may consume very substantially more resources than non-parallel queries,
  9. because each worker process is a completely separate process which has roughly the same impact on the system as an additional user session.
  10. This should be taken into account when choosing a value for this setting, as well as when configuring other settings that control resource utilization, such as work_mem.
  11. Resource limits such as work_mem are applied individually to each worker, which means the total utilization may be much higher across all processes than it would normally be for any single process.
  12. For example, a parallel query using 4 workers may use up to 5 times as much CPU time, memory, I/O bandwidth, and so forth as a query which uses no workers at all.
  13. For more information on parallel query, see Chapter 15.

4. http://www.postgresql.org/docs/9.6/static/runtime-config-query.html#RUNTIME-CONFIG-QUERY-CONSTANTS

  1. parallel_setup_cost (floating point)
  2. Sets the planner's estimate of the cost of launching parallel worker processes. The default is 1000.
  3. parallel_tuple_cost (floating point)
  4. Sets the planner's estimate of the cost of transferring one tuple from a parallel worker process to another process.
  5. The default is 0.1.
  6. min_parallel_relation_size (integer)
  7. Sets the minimum size of relations to be considered for parallel scan. The default is 8 megabytes (8MB).

优化器选择并行计算的相关参数

PostgreSQL会通过这些参数来决定是否使用并行,以及该启用几个work process。

1. max_worker_processes (integer)

很显然,这个参数决定了整个数据库集群允许启动多少个work process,注意如果有standby,standby的参数必须大于等于主库的参数值。

如果设置为0,表示不允许并行。

  1. Sets the maximum number of background processes that the system can support.
  2. This parameter can only be set at server start.
  3. The default is 8.
  4. When running a standby server, you must set this parameter to the same or higher value than on the master server.
  5. Otherwise, queries will not be allowed in the standby server.

2. max_parallel_workers_per_gather (integer)
这个参数决定了每个Gather node最多允许启用多少个work process。

同时需要注意,在OLTP业务系统中,不要设置太大,因为每个worker都会消耗同等的work_mem等资源,争抢会比较厉害。

建议在OLAP中使用并行,并且做好任务调度,减轻冲突。

  1. Sets the maximum number of workers that can be started by a single Gather node.
  2. Parallel workers are taken from the pool of processes established by max_worker_processes.
  3. Note that the requested number of workers may not actually be available at run time. -- 因为work process可能被使用了一些,整个系统还能开启的work process=max_worker_processes减去已使用的。
  4. If this occurs, the plan will run with fewer workers than expected, which may be inefficient.
  5. The default value is 2.
  6. Setting this value to 0 disables parallel query execution.
  7. Note that parallel queries may consume very substantially more resources than non-parallel queries, because each worker process is a completely separate process which has roughly the same impact on the system as an additional user session.
  8. This should be taken into account when choosing a value for this setting, as well as when configuring other settings that control resource utilization, such as work_mem.
  9. Resource limits such as work_mem are applied individually to each worker, which means the total utilization may be much higher across all processes than it would normally be for any single process.
  10. For example, a parallel query using 4 workers may use up to 5 times as much CPU time, memory, I/O bandwidth, and so forth as a query which uses no workers at all.

例子,WITH语法中,有两个QUERY用来并行计算,虽然设置的max_parallel_workers_per_gather=6,但是由于max_worker_processes=8,所以第一个Gather node用了6个worker process,而另一个Gather实际上只用了2个worker。

  1. postgres=# show max_worker_processes ;
  2. max_worker_processes
  3. ----------------------
  4. 8
  5. (1 row)
  6. postgres=# set max_parallel_workers_per_gather=6;
  7. SET
  8. postgres=# explain (analyze,verbose,costs,timing,buffers) with t as (select count(*) from test), t1 as (select count(id) from test) select * from t,t1;
  9. QUERY PLAN
  10. ------------------------------------------------------------------------------------------------------------------------------------------------------------------
  11. Nested Loop (cost=159471.81..159471.86 rows=1 width=16) (actual time=7763.033..7763.036 rows=1 loops=1)
  12. Output: t.count, t1.count
  13. Buffers: shared hit=32940 read=74784
  14. CTE t
  15. -> Finalize Aggregate (cost=79735.90..79735.91 rows=1 width=8) (actual time=4714.114..4714.115 rows=1 loops=1)
  16. Output: count(*)
  17. Buffers: shared hit=16564 read=37456
  18. -> Gather (cost=79735.27..79735.88 rows=6 width=8) (actual time=4714.016..4714.102 rows=7 loops=1)
  19. Output: (PARTIAL count(*))
  20. Workers Planned: 6
  21. Workers Launched: 6
  22. Buffers: shared hit=16564 read=37456
  23. -> Partial Aggregate (cost=78735.27..78735.28 rows=1 width=8) (actual time=4709.465..4709.466 rows=1 loops=7)
  24. Output: PARTIAL count(*)
  25. Buffers: shared hit=16084 read=37456
  26. Worker 0: actual time=4709.146..4709.146 rows=1 loops=1
  27. Buffers: shared hit=2167 read=5350
  28. Worker 1: actual time=4708.156..4708.156 rows=1 loops=1
  29. Buffers: shared hit=2140 read=5288
  30. Worker 2: actual time=4708.370..4708.370 rows=1 loops=1
  31. Buffers: shared hit=2165 read=4990
  32. Worker 3: actual time=4708.968..4708.969 rows=1 loops=1
  33. Buffers: shared hit=2501 read=5529
  34. Worker 4: actual time=4709.194..4709.195 rows=1 loops=1
  35. Buffers: shared hit=2469 read=5473
  36. Worker 5: actual time=4708.812..4708.813 rows=1 loops=1
  37. Buffers: shared hit=2155 read=5349
  38. -> Parallel Seq Scan on public.test (cost=0.00..73696.22 rows=2015622 width=0) (actual time=0.051..2384.380 rows=1728571 loops=7)
  39. Buffers: shared hit=16084 read=37456
  40. Worker 0: actual time=0.046..2385.108 rows=1698802 loops=1
  41. Buffers: shared hit=2167 read=5350
  42. Worker 1: actual time=0.057..2384.698 rows=1678728 loops=1
  43. Buffers: shared hit=2140 read=5288
  44. Worker 2: actual time=0.061..2384.109 rows=1617030 loops=1
  45. Buffers: shared hit=2165 read=4990
  46. Worker 3: actual time=0.046..2387.143 rows=1814780 loops=1
  47. Buffers: shared hit=2501 read=5529
  48. Worker 4: actual time=0.046..2382.491 rows=1794892 loops=1
  49. Buffers: shared hit=2469 read=5473
  50. Worker 5: actual time=0.070..2383.598 rows=1695904 loops=1
  51. Buffers: shared hit=2155 read=5349
  52. CTE t1
  53. -> Finalize Aggregate (cost=79735.90..79735.91 rows=1 width=8) (actual time=3048.902..3048.902 rows=1 loops=1)
  54. Output: count(test_1.id)
  55. Buffers: shared hit=16376 read=37328
  56. -> Gather (cost=79735.27..79735.88 rows=6 width=8) (actual time=3048.732..3048.880 rows=3 loops=1)
  57. Output: (PARTIAL count(test_1.id))
  58. Workers Planned: 6
  59. Workers Launched: 2
  60. Buffers: shared hit=16376 read=37328
  61. -> Partial Aggregate (cost=78735.27..78735.28 rows=1 width=8) (actual time=3046.399..3046.400 rows=1 loops=3)
  62. Output: PARTIAL count(test_1.id)
  63. Buffers: shared hit=16212 read=37328
  64. Worker 0: actual time=3045.394..3045.395 rows=1 loops=1
  65. Buffers: shared hit=5352 read=12343
  66. Worker 1: actual time=3045.339..3045.340 rows=1 loops=1
  67. Buffers: shared hit=5354 read=12402
  68. -> Parallel Seq Scan on public.test test_1 (cost=0.00..73696.22 rows=2015622 width=4) (actual time=0.189..1614.261 rows=4033333 loops=3)
  69. Output: test_1.id
  70. Buffers: shared hit=16212 read=37328
  71. Worker 0: actual time=0.039..1617.258 rows=3999030 loops=1
  72. Buffers: shared hit=5352 read=12343
  73. Worker 1: actual time=0.033..1610.934 rows=4012856 loops=1
  74. Buffers: shared hit=5354 read=12402
  75. -> CTE Scan on t (cost=0.00..0.02 rows=1 width=8) (actual time=4714.120..4714.121 rows=1 loops=1)
  76. Output: t.count
  77. Buffers: shared hit=16564 read=37456
  78. -> CTE Scan on t1 (cost=0.00..0.02 rows=1 width=8) (actual time=3048.907..3048.908 rows=1 loops=1)
  79. Output: t1.count
  80. Buffers: shared hit=16376 read=37328
  81. Planning time: 0.144 ms
  82. Execution time: 7766.458 ms
  83. (72 rows)

3. parallel_setup_cost (floating point)

表示启动woker process的启动成本,因为启动worker进程需要建立共享内存等操作,属于附带的额外成本。

  1. Sets the planner's estimate of the cost of launching parallel worker processes.
  2. The default is 1000.

4. parallel_tuple_cost (floating point)

woker进程处理完后的tuple要传输给上层node,即进程间的row交换成本,按node评估的输出rows来乘。

  1. Sets the planner's estimate of the cost of transferring one tuple from a parallel worker process to another process.
  2. The default is 0.1.

代码如下

  1. parallel_tuple_cost : Cost of CPU time to pass a tuple from worker to master backend
  2. parallel_setup_cost : Cost of setting up shared memory for parallelism
  3. //
  4. double parallel_tuple_cost = DEFAULT_PARALLEL_TUPLE_COST;
  5. double parallel_setup_cost = DEFAULT_PARALLEL_SETUP_COST;
  6. //
  7. /*
  8. * cost_gather
  9. * Determines and returns the cost of gather path.
  10. *
  11. * 'rel' is the relation to be operated upon
  12. * 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL
  13. * 'rows' may be used to point to a row estimate; if non-NULL, it overrides
  14. * both 'rel' and 'param_info'. This is useful when the path doesn't exactly
  15. * correspond to any particular RelOptInfo.
  16. */
  17. void
  18. cost_gather(GatherPath *path, PlannerInfo *root,
  19. RelOptInfo *rel, ParamPathInfo *param_info,
  20. double *rows)
  21. {
  22. Cost startup_cost = 0;
  23. Cost run_cost = 0;
  24. //
  25. /* Mark the path with the correct row estimate */
  26. if (rows)
  27. path->path.rows = *rows;
  28. else if (param_info)
  29. path->path.rows = param_info->ppi_rows;
  30. else
  31. path->path.rows = rel->rows;
  32. //
  33. startup_cost = path->subpath->startup_cost;
  34. //
  35. run_cost = path->subpath->total_cost - path->subpath->startup_cost;
  36. //
  37. /* Parallel setup and communication cost. */
  38. startup_cost += parallel_setup_cost; // 累加启动成本
  39. run_cost += parallel_tuple_cost * path->path.rows; // 累加tuple的worker与上层进程间传输成本
  40. //
  41. path->path.startup_cost = startup_cost;
  42. path->path.total_cost = (startup_cost + run_cost);
  43. }

5. min_parallel_relation_size (integer)
表的大小,也作为是否启用并行计算的条件,如果小于它,不启用并行计算。

但是也请注意,还有其他条件决定是否启用并行,所以并不是小于它的表就一定不会启用并行。

  1. Sets the minimum size of relations to be considered for parallel scan.
  2. The default is 8 megabytes (8MB).

代码如下
src/backend/optimizer/path/allpaths.c

  1. /*
  2. * create_plain_partial_paths
  3. * Build partial access paths for parallel scan of a plain relation
  4. */
  5. static void
  6. create_plain_partial_paths(PlannerInfo *root, RelOptInfo *rel)
  7. {
  8. int parallel_workers;
  9. //
  10. /*
  11. * If the user has set the parallel_workers reloption, use that; otherwise
  12. * select a default number of workers.
  13. */
  14. if (rel->rel_parallel_workers != -1) // 如果设置了表级的parallel_workers参数,则直接使用这个作为并行度。
  15. parallel_workers = rel->rel_parallel_workers;
  16. else // 如果没有设置表级并行度参数,则使用表的大小计算出一个合适的并行度
  17. {
  18. int parallel_threshold;
  19. //
  20. /*
  21. * If this relation is too small to be worth a parallel scan, just
  22. * return without doing anything ... unless it's an inheritance child.
  23. * In that case, we want to generate a parallel path here anyway. It
  24. * might not be worthwhile just for this relation, but when combined
  25. * with all of its inheritance siblings it may well pay off.
  26. */
  27. if (rel->pages < (BlockNumber) min_parallel_relation_size &&
  28. rel->reloptkind == RELOPT_BASEREL) // 如果表的大小小于设置的min_parallel_relation_size(单位为block),不启用并行
  29. return;
  30. //
  31. /*
  32. * Select the number of workers based on the log of the size of the
  33. * relation. This probably needs to be a good deal more
  34. * sophisticated, but we need something here for now. Note that the
  35. * upper limit of the min_parallel_relation_size GUC is chosen to
  36. * prevent overflow here.
  37. */
  38. // 以下算法目前还不完善,根据表的大小计算出需要开多大的并行。 算法如下
  39. parallel_workers = 1;
  40. parallel_threshold = Max(min_parallel_relation_size, 1);
  41. while (rel->pages >= (BlockNumber) (parallel_threshold * 3))
  42. {
  43. parallel_workers++;
  44. parallel_threshold *= 3;
  45. if (parallel_threshold > INT_MAX / 3)
  46. break; /* avoid overflow */
  47. }
  48. }
  49. //
  50. /*
  51. * In no case use more than max_parallel_workers_per_gather workers.
  52. */
  53. parallel_workers = Min(parallel_workers, max_parallel_workers_per_gather); // 根据计算出的并行度值,与max_parallel_workers_per_gather参数比较,取小的。 就是需要开启的并行度。
  54. //
  55. /* If any limit was set to zero, the user doesn't want a parallel scan. */
  56. if (parallel_workers <= 0)
  57. return;
  58. //
  59. /* Add an unordered partial path based on a parallel sequential scan. */
  60. add_partial_path(rel, create_seqscan_path(root, rel, NULL, parallel_workers)); // 根据计算出来的并行度,添加execute worker path。
  61. }

6. force_parallel_mode (enum)

强制开启并行,可以作为测试的目的,也可以作为hint来使用。

  1. Allows the use of parallel queries for testing purposes even in cases where no performance benefit is expected.
  2. The allowed values of force_parallel_mode are
  3. off (use parallel mode only when it is expected to improve performance),
  4. on (force parallel query for all queries for which it is thought to be safe),
  5. regress (like on, but with additional behavior changes as explained below).
  6. More specifically, setting this value to on will add a Gather node to the top of any query plan for which this appears to be safe, so that the query runs inside of a parallel worker.
  7. Even when a parallel worker is not available or cannot be used, operations such as starting a subtransaction that would be prohibited in a parallel query context will be prohibited unless the planner believes that this will cause the query to fail.
  8. If failures or unexpected results occur when this option is set, some functions used by the query may need to be marked PARALLEL UNSAFE (or, possibly, PARALLEL RESTRICTED).
  9. Setting this value to regress has all of the same effects as setting it to on plus some additional effects that are intended to facilitate automated regression testing.
  10. Normally, messages from a parallel worker include a context line indicating that, but a setting of regress suppresses this line so that the output is the same as in non-parallel execution.
  11. Also, the Gather nodes added to plans by this setting are hidden in EXPLAIN output so that the output matches what would be obtained if this setting were turned off.

7. parallel_workers (integer)
以上都是数据库的参数,parallel_workers是表级参数,可以在建表时设置,也可以后期设置。

代码见create_plain_partial_paths()

  1. create table ... WITH( storage parameter ... )
  2. This sets the number of workers that should be used to assist a parallel scan of this table.
  3. If not set, the system will determine a value based on the relation size.
  4. The actual number of workers chosen by the planner may be less, for example due to the setting of max_worker_processes.

例子

  1. 设置表级并行度
  2. alter table test set (parallel_workers=0);
  3. 关闭表的并行
  4. alter table test set (parallel_workers=0);
  5. 重置参数,那么在create_plain_partial_paths中会通过表的pages计算出一个合理的并行度
  6. alter table test reset (parallel_workers);

PG优化器如何决定使用并行或者如何计算并行度

其实前面在讲参数时都已经讲到了,这里再总结一下。

1. 决定整个系统能开多少个worker进程
max_worker_processes

2. 计算并行计算的成本,优化器根据CBO原则选择是否开启并行
parallel_setup_cost
parallel_tuple_cost

所以简单QUERY,如果COST本来就很低(比如小于并行计算的启动成本),那么很显然数据库不会对这种QUERY启用并行计算。

如果要强制并行,建议都设置为0.

3. 强制开启并行的开关
force_parallel_mode
当第二步计算出来的成本大于非并行的成本时,可以通过这种方式强制让优化器开启并行查询。

4. 根据表级parallel_workers参数决定每个Gather node的并行度
取min(parallel_workers, max_parallel_workers_per_gather)

5. 当表没有设置parallel_workers参数并且表的大小大于min_parallel_relation_size是,由算法决定每个Gather node的并行度
相关参数 min_parallel_relation_size
算法见 create_plain_partial_paths
取Min(parallel_workers, max_parallel_workers_per_gather)

注意实际上,每个Gather能开启多少个worker还和PG集群总体剩余可以开启的worker进程数相关。
因此实际开启的可能小于优化器算出来的。从前面的例子中也可以理解。

6. 用户也可以使用hint来控制优化器选择是否强制并行 , 参考pg_hint_plan插件的用法。