前言

亚马逊推出的Aurora数据库引擎,支持一份存储,一主多读的架构。这个架构和Oracle RAC类似,也是共享存储,但是只有一个实例可以执行写操作,其他实例只能执行读操作。相比传统的基于复制的一主多读,节约了存储和网络带宽的成本。

我们可以使用PostgreSQL的hot standby模式来模拟这种共享存储一主多读的架构,但是需要注意几点,hot standby也会对数据库有写的动作,例如recovery时,会修改控制文件,数据文件等等,这些操作是多余的。另外很多状态是存储在内存中的,所以内存状态也需要更新。

还有需要注意的是:

  1. pg_xlog
  2. pg_log
  3. pg_clog
  4. pg_multixact
  5. postgresql.conf
  6. recovery.conf
  7. postmaster.pid

最终实现一主多备的架构,需要通过改PG内核来实现:

  1. 这些文件应该是每个实例对应一份。 postgresql.conf, recovery.conf, postmaster.pid, pg_control
  2. hot standby不执行实际的恢复操作,但是需要更新自己的内存状态,如当前的OID,XID等等,以及更新自己的pg_control。
  3. 在多实例间,要实现主到备节点的OS脏页的同步,数据库shared buffer脏页的同步。

模拟过程

不改任何代码,在同一主机下启多实例测试,会遇到一些问题。(后面有问题描述,以及如何修改代码来修复这些问题)

主实例配置文件:

  1. # vi postgresql.conf
  2. listen_addresses='0.0.0.0'
  3. port=1921
  4. max_connections=100
  5. unix_socket_directories='.'
  6. ssl=on
  7. ssl_ciphers='EXPORT40'
  8. shared_buffers=512MB
  9. huge_pages=try
  10. max_prepared_transactions=0
  11. max_stack_depth=100kB
  12. dynamic_shared_memory_type=posix
  13. max_files_per_process=500
  14. wal_level=logical
  15. fsync=off
  16. synchronous_commit=off
  17. wal_sync_method=open_datasync
  18. full_page_writes=off
  19. wal_log_hints=off
  20. wal_buffers=16MB
  21. wal_writer_delay=10ms
  22. checkpoint_segments=8
  23. archive_mode=off
  24. archive_command='/bin/date'
  25. max_wal_senders=10
  26. max_replication_slots=10
  27. hot_standby=on
  28. wal_receiver_status_interval=1s
  29. hot_standby_feedback=on
  30. enable_bitmapscan=on
  31. enable_hashagg=on
  32. enable_hashjoin=on
  33. enable_indexscan=on
  34. enable_material=on
  35. enable_mergejoin=on
  36. enable_nestloop=on
  37. enable_seqscan=on
  38. enable_sort=on
  39. enable_tidscan=on
  40. log_destination='csvlog'
  41. logging_collector=on
  42. log_directory='pg_log'
  43. log_truncate_on_rotation=on
  44. log_rotation_size=10MB
  45. log_checkpoints=on
  46. log_connections=on
  47. log_disconnections=on
  48. log_duration=off
  49. log_error_verbosity=verbose
  50. log_line_prefix='%i
  51. log_statement='none'
  52. log_timezone='PRC'
  53. autovacuum=on
  54. log_autovacuum_min_duration=0
  55. autovacuum_vacuum_scale_factor=0.0002
  56. autovacuum_analyze_scale_factor=0.0001
  57. datestyle='iso,
  58. timezone='PRC'
  59. lc_messages='C'
  60. lc_monetary='C'
  61. lc_numeric='C'
  62. lc_time='C'
  63. default_text_search_config='pg_catalog.english'
  64. # vi recovery.done
  65. recovery_target_timeline='latest'
  66. standby_mode=on
  67. primary_conninfo = 'host=127.0.0.1 port=1921 user=postgres keepalives_idle=60'
  68. # vi pg_hba.conf
  69. local replication postgres trust
  70. host replication postgres 127.0.0.1/32 trust

启动主实例。

  1. postgres@digoal-> pg_ctl start

启动只读实例,必须先删除postmaster.pid,这点PostgreSQL新版本加了一个PATCH,如果这个文件被删除,会自动关闭数据库,所以我们需要注意,不要使用最新的PGSQL,或者把这个patch干掉先。

  1. postgres@digoal-> cd $PGDATA
  2. postgres@digoal-> mv recovery.done recovery.conf
  3. postgres@digoal-> rm -f postmaster.pid
  4. postgres@digoal-> pg_ctl start -o "-c log_directory=pg_log1922 -c port=1922"

查看当前控制文件状态,只读实例改了控制文件,和前面描述一致。

  1. postgres@digoal-> pg_controldata |grep state
  2. Database cluster state: in archive recovery

连到主实例,创建表,插入测试数据。

  1. psql -p 1921
  2. postgres=# create table test1(id int);
  3. CREATE TABLE
  4. postgres=# insert into test1 select generate_series(1,10);
  5. INSERT 0 10

在只读实例查看插入的数据。

  1. postgres@digoal-> psql -h 127.0.0.1 -p 1922
  2. postgres=# select * from test1;
  3. id
  4. ----
  5. 1
  6. 2
  7. 3
  8. 4
  9. 5
  10. 6
  11. 7
  12. 8
  13. 9
  14. 10
  15. (10 rows)

主实例执行检查点后,控制文件状态会改回生产状态。

  1. psql -p 1921
  2. postgres=# checkpoint;
  3. CHECKPOINT
  4. postgres@digoal-> pg_controldata |grep state
  5. Database cluster state: in production

但是如果在只读实例执行完检查点,又会改回恢复状态。

  1. postgres@digoal-> psql -h 127.0.0.1 -p 1922
  2. psql (9.4.4)
  3. postgres=# checkpoint;
  4. CHECKPOINT
  5. postgres@digoal-> pg_controldata |grep state
  6. Database cluster state: in archive recovery

注意到,上面的例子有1个问题,用流复制的话,会从主节点通过网络拷贝XLOG记录,并覆盖同一份已经写过的XLOG记录的对应的OFFSET,这是一个问题,因为可能会造成主节点看到的数据不一致(比如一个数据块改了多次,只读实例在恢复时将它覆盖到老的版本了,在主实例上看到的就会变成老版本的BLOCK,后面再来改这个问题,禁止只读实例恢复数据)。

另一方面,我们知道PostgreSQL standby会从三个地方(流、pg_xlog、restore_command)读取XLOG进行恢复,所以在共享存储的环境中,我们完全没有必要用流复制的方式,直接从pg_xlog目录读取即可。修改recovery.conf参数,将以下注释

  1. # primary_conninfo = 'host=127.0.0.1 port=1921 user=postgres keepalives_idle=60'

重启只读实例。

  1. pg_ctl stop -m fast
  2. postgres@digoal-> pg_ctl start -o "-c log_directory=pg_log1922 -c port=1922"

重新测试数据一致性。 主实例:

  1. postgres=# insert into test1 select generate_series(1,10);
  2. INSERT 0 10
  3. postgres=# insert into test1 select generate_series(1,10);
  4. INSERT 0 10
  5. postgres=# insert into test1 select generate_series(1,10);
  6. INSERT 0 10
  7. postgres=# insert into test1 select generate_series(1,10);
  8. INSERT 0 10

只读实例:

  1. postgres=# select count(*) from test1;
  2. count
  3. -------
  4. 60
  5. (1 row)

问题分析和解决

截至目前,有几个问题未解决:

  1. standby还是要执行recovery的操作,recovery产生的write操作会随着只读实例数量的增加而增加。另外recovery有一个好处,解决了脏页的问题,主实例shared buffer中的脏页不需要额外的同步给只读实例了。recovery还会带来一个严重的BUG,回放可能和当前主节点操作同一个data page;或者回放时将块回放到老的状态,而实际上主节点又更新了这个块,造成数据块的不一致。如果此时只读实例关闭,然后立即关闭主实例,数据库再起来时,这个数据块是不一致的;
  2. standby还是会改控制文件;
  3. 在同一个$PGDATA下启动实例,首先要删除postmaster.pid;
  4. 关闭实例时,已经被删除postmaster.pid的实例,只能通过找到postgres主进程的pid,然后发kill -s 15, 2或3的信号来关闭数据库;

    1. static void
    2. set_mode(char *modeopt)
    3. {
    4. if (strcmp(modeopt, "s") == 0 || strcmp(modeopt, "smart") == 0)
    5. {
    6. shutdown_mode = SMART_MODE;
    7. sig = SIGTERM;
    8. }
    9. else if (strcmp(modeopt, "f") == 0 || strcmp(modeopt, "fast") == 0)
    10. {
    11. shutdown_mode = FAST_MODE;
    12. sig = SIGINT;
    13. }
    14. else if (strcmp(modeopt, "i") == 0 || strcmp(modeopt, "immediate") == 0)
    15. {
    16. shutdown_mode = IMMEDIATE_MODE;
    17. sig = SIGQUIT;
    18. }
    19. else
    20. {
    21. write_stderr(_("%s: unrecognized shutdown mode \"%s\"\n"), progname, modeopt);
    22. do_advice();
    23. exit(1);
    24. }
    25. }
  5. 当主节点删除rel page时,只读实例回放时,会报invalid xlog对应的rel page不存在的错误,这个也是只读实例需要回放日志带来的问题。非常容易重现这个问题,删除一个表即可。

    1. 2015-10-09 13:30:50.776 CST,,,2082,,561750ab.822,20,,2015-10-09 13:29:15 CST,1/0,0,WARNING,01000,"page 8 of relation base/151898/185251 does not exist",,,,,"xlog redo clean: rel 1663/151898/185251; blk 8 remxid 640632117",,,"report_invalid_page, xlogutils.c:67",""
    2. 2015-10-09 13:30:50.776 CST,,,2082,,561750ab.822,21,,2015-10-09 13:29:15 CST,1/0,0,PANIC,XX000,"WAL contains references to invalid pages",,,,,"xlog redo clean: rel 1663/151898/185251; blk 8 remxid 640632117",,,"log_invalid_page, xlogutils.c:91",""

    这个报错可以先注释这一段来绕过,从而可以演示下去。

    1. src/backend/access/transam/xlogutils.c
    2. /* Log a reference to an invalid page */
    3. static void
    4. log_invalid_page(RelFileNode node, ForkNumber forkno, BlockNumber blkno,
    5. bool present)
    6. {
    7. //////
    8. /*
    9. * Once recovery has reached a consistent state, the invalid-page table
    10. * should be empty and remain so. If a reference to an invalid page is
    11. * found after consistency is reached, PANIC immediately. This might seem
    12. * aggressive, but it's better than letting the invalid reference linger
    13. * in the hash table until the end of recovery and PANIC there, which
    14. * might come only much later if this is a standby server.
    15. */
    16. //if (reachedConsistency)
    17. //{
    18. // report_invalid_page(WARNING, node, forkno, blkno, present);
    19. // elog(PANIC, "WAL contains references to invalid pages");
    20. //}
  6. 由于本例是在同一个操作系统中演示,所以没有遇到OS的dirty page cache的问题,如果是不同主机的环境,我们需要解决OS dirty page cache 的同步问题,或者消除dirty page cache,如使用direct IO。或者集群文件系统如gfs2。

如果要产品化,至少需要解决以上问题。

先解决Aurora实例写数据文件、控制文件、检查点的问题。

  1. 增加一个启动参数,表示这个实例是否为Aurora实例(即只读实例)

    1. # vi src/backend/utils/misc/guc.c
    2. /******** option records follow ********/
    3. static struct config_bool ConfigureNamesBool[] =
    4. {
    5. {
    6. {"aurora", PGC_POSTMASTER, CONN_AUTH_SETTINGS,
    7. gettext_noop("Enables advertising the server via Bonjour."),
    8. NULL
    9. },
    10. &aurora,
    11. false,
    12. NULL, NULL, NULL
    13. },
  2. 新增变量

    1. # vi src/include/postmaster/postmaster.h
    2. extern bool aurora;
  3. 禁止Aurora实例更新控制文件

    1. # vi src/backend/access/transam/xlog.c
    2. #include "postmaster/postmaster.h"
    3. bool aurora;
    4. void
    5. UpdateControlFile(void)
    6. {
    7. if (aurora) return;
  4. 禁止Aurora实例启动bgwriter进程

    1. # vi src/backend/postmaster/bgwriter.c
    2. #include "postmaster/postmaster.h"
    3. bool aurora;
    4. /*
    5. * Main entry point for bgwriter process
    6. *
    7. * This is invoked from AuxiliaryProcessMain, which has already created the
    8. * basic execution environment, but not enabled signals yet.
    9. */
    10. void
    11. BackgroundWriterMain(void)
    12. {
    13. //////
    14. pg_usleep(1000000L);
    15. /*
    16. * If an exception is encountered, processing resumes here.
    17. *
    18. * See notes in postgres.c about the design of this coding.
    19. */
    20. if (!aurora && sigsetjmp(local_sigjmp_buf, 1) != 0)
    21. {
    22. //////
    23. /*
    24. * Do one cycle of dirty-buffer writing.
    25. */
    26. if (!aurora) {
    27. can_hibernate = BgBufferSync();
    28. //////
    29. }
    30. pg_usleep(1000000L);
    31. }
    32. }
  5. 禁止Aurora实例启动checkpointer进程

    1. # vi src/backend/postmaster/checkpointer.c
    2. #include "postmaster/postmaster.h"
    3. bool aurora;
    4. //////
    5. /*
    6. * Main entry point for checkpointer process
    7. *
    8. * This is invoked from AuxiliaryProcessMain, which has already created the
    9. * basic execution environment, but not enabled signals yet.
    10. */
    11. void
    12. CheckpointerMain(void)
    13. {
    14. //////
    15. /*
    16. * Loop forever
    17. */
    18. for (;;)
    19. {
    20. bool do_checkpoint = false;
    21. int flags = 0;
    22. pg_time_t now;
    23. int elapsed_secs;
    24. int cur_timeout;
    25. int rc;
    26. pg_usleep(100000L);
    27. /* Clear any already-pending wakeups */
    28. if (!aurora) ResetLatch(&MyProc->procLatch);
    29. /*
    30. * Process any requests or signals received recently.
    31. */
    32. if (!aurora) AbsorbFsyncRequests();
    33. if (!aurora && got_SIGHUP)
    34. {
    35. got_SIGHUP = false;
    36. ProcessConfigFile(PGC_SIGHUP);
    37. /*
    38. * Checkpointer is the last process to shut down, so we ask it to
    39. * hold the keys for a range of other tasks required most of which
    40. * have nothing to do with checkpointing at all.
    41. *
    42. * For various reasons, some config values can change dynamically
    43. * so the primary copy of them is held in shared memory to make
    44. * sure all backends see the same value. We make Checkpointer
    45. * responsible for updating the shared memory copy if the
    46. * parameter setting changes because of SIGHUP.
    47. */
    48. UpdateSharedMemoryConfig();
    49. }
    50. if (!aurora && checkpoint_requested)
    51. {
    52. checkpoint_requested = false;
    53. do_checkpoint = true;
    54. BgWriterStats.m_requested_checkpoints++;
    55. }
    56. if (!aurora && shutdown_requested)
    57. {
    58. /*
    59. * From here on, elog(ERROR) should end with exit(1), not send
    60. * control back to the sigsetjmp block above
    61. */
    62. ExitOnAnyError = true;
    63. /* Close down the database */
    64. ShutdownXLOG(0, 0);
    65. /* Normal exit from the checkpointer is here */
    66. proc_exit(0); /* done */
    67. }
    68. /*
    69. * Force a checkpoint if too much time has elapsed since the last one.
    70. * Note that we count a timed checkpoint in stats only when this
    71. * occurs without an external request, but we set the CAUSE_TIME flag
    72. * bit even if there is also an external request.
    73. */
    74. now = (pg_time_t) time(NULL);
    75. elapsed_secs = now - last_checkpoint_time;
    76. if (!aurora && elapsed_secs >= CheckPointTimeout)
    77. {
    78. if (!do_checkpoint)
    79. BgWriterStats.m_timed_checkpoints++;
    80. do_checkpoint = true;
    81. flags |= CHECKPOINT_CAUSE_TIME;
    82. }
    83. /*
    84. * Do a checkpoint if requested.
    85. */
    86. if (!aurora && do_checkpoint)
    87. {
    88. bool ckpt_performed = false;
    89. bool do_restartpoint;
    90. /* use volatile pointer to prevent code rearrangement */
    91. volatile CheckpointerShmemStruct *cps = CheckpointerShmem;
    92. /*
    93. * Check if we should perform a checkpoint or a restartpoint. As a
    94. * side-effect, RecoveryInProgress() initializes TimeLineID if
    95. * it's not set yet.
    96. */
    97. do_restartpoint = RecoveryInProgress();
    98. /*
    99. * Atomically fetch the request flags to figure out what kind of a
    100. * checkpoint we should perform, and increase the started-counter
    101. * to acknowledge that we've started a new checkpoint.
    102. */
    103. SpinLockAcquire(&cps->ckpt_lck);
    104. flags |= cps->ckpt_flags;
    105. cps->ckpt_flags = 0;
    106. cps->ckpt_started++;
    107. SpinLockRelease(&cps->ckpt_lck);
    108. /*
    109. * The end-of-recovery checkpoint is a real checkpoint that's
    110. * performed while we're still in recovery.
    111. */
    112. if (flags & CHECKPOINT_END_OF_RECOVERY)
    113. do_restartpoint = false;
    114. //////
    115. ckpt_active = false;
    116. }
    117. /* Check for archive_timeout and switch xlog files if necessary. */
    118. if (!aurora) CheckArchiveTimeout();
    119. /*
    120. * Send off activity statistics to the stats collector. (The reason
    121. * why we re-use bgwriter-related code for this is that the bgwriter
    122. * and checkpointer used to be just one process. It's probably not
    123. * worth the trouble to split the stats support into two independent
    124. * stats message types.)
    125. */
    126. if (!aurora) pgstat_send_bgwriter();
    127. /*
    128. * Sleep until we are signaled or it's time for another checkpoint or
    129. * xlog file switch.
    130. */
    131. now = (pg_time_t) time(NULL);
    132. elapsed_secs = now - last_checkpoint_time;
    133. if (elapsed_secs >= CheckPointTimeout)
    134. continue; /* no sleep for us ... */
    135. cur_timeout = CheckPointTimeout - elapsed_secs;
    136. if (!aurora && XLogArchiveTimeout > 0 && !RecoveryInProgress())
    137. {
    138. elapsed_secs = now - last_xlog_switch_time;
    139. if (elapsed_secs >= XLogArchiveTimeout)
    140. continue; /* no sleep for us ... */
    141. cur_timeout = Min(cur_timeout, XLogArchiveTimeout - elapsed_secs);
    142. }
    143. if (!aurora) rc = WaitLatch(&MyProc->procLatch,
    144. WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
    145. cur_timeout * 1000L /* convert to ms */ );
    146. /*
    147. * Emergency bailout if postmaster has died. This is to avoid the
    148. * necessity for manual cleanup of all postmaster children.
    149. */
    150. if (rc & WL_POSTMASTER_DEATH)
    151. exit(1);
    152. }
    153. }
    154. //////
    155. /* SIGINT: set flag to run a normal checkpoint right away */
    156. static void
    157. ReqCheckpointHandler(SIGNAL_ARGS)
    158. {
    159. if (aurora)
    160. return;
    161. int save_errno = errno;
    162. checkpoint_requested = true;
    163. if (MyProc)
    164. SetLatch(&MyProc->procLatch);
    165. errno = save_errno;
    166. }
    167. //////
    168. /*
    169. * AbsorbFsyncRequests
    170. * Retrieve queued fsync requests and pass them to local smgr.
    171. *
    172. * This is exported because it must be called during CreateCheckPoint;
    173. * we have to be sure we have accepted all pending requests just before
    174. * we start fsync'ing. Since CreateCheckPoint sometimes runs in
    175. * non-checkpointer processes, do nothing if not checkpointer.
    176. */
    177. void
    178. AbsorbFsyncRequests(void)
    179. {
    180. CheckpointerRequest *requests = NULL;
    181. CheckpointerRequest *request;
    182. int n;
    183. if (!AmCheckpointerProcess() || aurora)
    184. return;
    185. //////
  6. 禁止Aurora实例手工调用checkpoint命令

    1. # vi src/backend/tcop/utility.c
    2. #include "postmaster/postmaster.h"
    3. bool aurora;
    4. //////
    5. void
    6. standard_ProcessUtility(Node *parsetree,
    7. const char *queryString,
    8. ProcessUtilityContext context,
    9. ParamListInfo params,
    10. DestReceiver *dest,
    11. char *completionTag)
    12. {
    13. //////
    14. case T_CheckPointStmt:
    15. if (!superuser() || aurora)
    16. ereport(ERROR,
    17. (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
    18. errmsg("must be superuser to do CHECKPOINT")));

改完上面的代码,重新编译一下,现在接近一个DEMO了。现在Aurora实例不会更新控制文件,不会写数据文件,不会执行checkpoint,是我们想要的结果。 启动只读实例时,加一个参数aurora=true,表示启动Aurora实例。

  1. pg_ctl start -o "-c log_directory=pg_log1922 -c port=1922 -c aurora=true"

不过要产品化,还有很多细节需要考虑,这只是一个DEMO。阿里云RDS的小伙伴们加油!

还有一种更保险的玩法,共享存储多读架构,需要存储两份数据。其中一份是主实例的存储,它自己玩自己的,其他实例不对它做任何操作;另一份是standby的,这部作为共享存储,给多个只读实例来使用。

参考

  1. https://aws.amazon.com/cn/rds/aurora/
  2. src/backend/access/transam/xlog.c

    1. /*
    2. * Open the WAL segment containing WAL position 'RecPtr'.
    3. *
    4. * The segment can be fetched via restore_command, or via walreceiver having
    5. * streamed the record, or it can already be present in pg_xlog. Checking
    6. * pg_xlog is mainly for crash recovery, but it will be polled in standby mode
    7. * too, in case someone copies a new segment directly to pg_xlog. That is not
    8. * documented or recommended, though.
    9. *
    10. * If 'fetching_ckpt' is true, we're fetching a checkpoint record, and should
    11. * prepare to read WAL starting from RedoStartLSN after this.
    12. *
    13. * 'RecPtr' might not point to the beginning of the record we're interested
    14. * in, it might also point to the page or segment header. In that case,
    15. * 'tliRecPtr' is the position of the WAL record we're interested in. It is
    16. * used to decide which timeline to stream the requested WAL from.
    17. *
    18. * If the record is not immediately available, the function returns false
    19. * if we're not in standby mode. In standby mode, waits for it to become
    20. * available.
    21. *
    22. * When the requested record becomes available, the function opens the file
    23. * containing it (if not open already), and returns true. When end of standby
    24. * mode is triggered by the user, and there is no more WAL available, returns
    25. * false.
    26. */
    27. static bool
    28. WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
    29. bool fetching_ckpt, XLogRecPtr tliRecPtr)
    30. {
    31. //////
    32. static pg_time_t last_fail_time = 0;
    33. pg_time_t now;
    34. /*-------
    35. * Standby mode is implemented by a state machine:
    36. *
    37. * 1. Read from either archive or pg_xlog (XLOG_FROM_ARCHIVE), or just
    38. * pg_xlog (XLOG_FROM_XLOG)
    39. * 2. Check trigger file
    40. * 3. Read from primary server via walreceiver (XLOG_FROM_STREAM)
    41. * 4. Rescan timelines
    42. * 5. Sleep 5 seconds, and loop back to 1.
    43. *
    44. * Failure to read from the current source advances the state machine to
    45. * the next state.
    46. *
    47. * 'currentSource' indicates the current state. There are no currentSource
    48. * values for "check trigger", "rescan timelines", and "sleep" states,
    49. * those actions are taken when reading from the previous source fails, as
    50. * part of advancing to the next state.
    51. *-------
    52. */