Master/Slave Architecture

Introduction

In a Master/Slave setup one or more ArangoDB slaves asynchronously replicatefrom a master.

The master is the ArangoDB instance where all data-modification operations shouldbe directed to. The slave is the ArangoDB instance that replicates the data fromthe master.

Components

Replication Logger

Purpose

The replication logger will write all data-modification operations into thewrite-ahead log. This log may then be read by clients to replay any datamodification on a different server.

Checking the state

To query the current state of the logger, use the state command:

  1. require("@arangodb/replication").logger.state();

The result might look like this:

  1. {
  2. "state" : {
  3. "running" : true,
  4. "lastLogTick" : "2339941",
  5. "lastUncommittedLogTick" : "2339941",
  6. "totalEvents" : 2339941,
  7. "time" : "2019-07-02T10:30:30Z"
  8. },
  9. "server" : {
  10. "version" : "3.5.0",
  11. "serverId" : "194754235820456",
  12. "engine" : "rocksdb"
  13. },
  14. "clients" : [
  15. {
  16. "syncerId" : "158",
  17. "serverId" : "161976545824597",
  18. "time" : "1970-01-23T07:59:10Z",
  19. "expires" : "1970-01-23T09:59:10Z",
  20. "lastServedTick" : 2339908
  21. }
  22. ]
  23. }

The running attribute will always be true. In earlier versions of ArangoDB thereplication was optional and this could have been false.

The totalEvents attribute indicates how many log events have been logged sincethe start of the ArangoDB server. The lastLogTick value indicates the id of the last committed operation that was written to the server’s write-ahead log.It can be used to determine whether new operations were logged, and is also usedby the replication applier for incremental fetching of data. The lastUncommittedLogTick_value contains the _id of the last uncommitted operation that was written to theserver’s WAL. For the RocksDB storage engine, lastLogTick and lastUncommittedLogTick are identical, as the WAL only contains committed operations.

The clients attribute reveals which clients (slaves) have connected to themaster recently, and up to which tick value they caught up with the replication.

Note: The replication logger state can also be queried via theHTTP API.

To query which data ranges are still available for replication clients to fetch,the logger provides the firstTick and tickRanges functions:

  1. require("@arangodb/replication").logger.firstTick();

This will return the minimum tick value that the server can provide to replicationclients via its replication APIs. The tickRanges function returns the minimumand maximum tick values per logfile:

  1. require("@arangodb/replication").logger.tickRanges();

Replication Applier

Purpose

The purpose of the replication applier is to read data from a master database’sevent log, and apply them locally. The applier will check the master databasefor new operations periodically. It will perform an incremental synchronization,i.e. only asking the master for operations that occurred after the last synchronization.

The replication applier does not get notified by the master database when thereare “new” operations available, but instead uses the pull principle. It might thustake some time (the so-called replication lag) before an operation from the masterdatabase gets shipped to, and applied in, a slave database.

The replication applier of a database is run in a separate thread. It may encounterproblems when an operation from the master cannot be applied safely, or when theconnection to the master database goes down (network outage, master database isdown or unavailable etc.). In this case, the database’s replication applier threadmight terminate itself. It is then up to the administrator to fix the problem andrestart the database’s replication applier.

If the replication applier cannot connect to the master database, or thecommunication fails at some point during the synchronization, the _replication applier_will try to reconnect to the master database. It will give up reconnecting onlyafter a configurable amount of connection attempts.

The replication applier state is queryable at any time by using the state commandof the applier. This will return the state of the applier of the current database:

  1. require("@arangodb/replication").applier.state();

The result might look like this:

  1. {
  2. "state" : {
  3. "started" : "2019-03-01T11:36:33Z",
  4. "running" : true,
  5. "phase" : "running",
  6. "lastAppliedContinuousTick" : "2050724544",
  7. "lastProcessedContinuousTick" : "2050724544",
  8. "lastAvailableContinuousTick" : "2050724546",
  9. "safeResumeTick" : "2050694546",
  10. "ticksBehind" : 2,
  11. "progress" : {
  12. "time" : "2019-03-01T11:36:33Z",
  13. "message" : "fetching master log from tick 2050694546, last scanned tick 2050664547, first regular tick 2050544543, barrier: 0, open transactions: 1, chunk size 6291456",
  14. "failedConnects" : 0
  15. },
  16. "totalRequests" : 2,
  17. "totalFailedConnects" : 0,
  18. "totalEvents" : 50010,
  19. "totalDocuments" : 50000,
  20. "totalRemovals" : 0,
  21. "totalResyncs" : 0,
  22. "totalOperationsExcluded" : 0,
  23. "totalApplyTime" : 1.1071290969848633,
  24. "averageApplyTime" : 1.1071290969848633,
  25. "totalFetchTime" : 0.2129514217376709,
  26. "averageFetchTime" : 0.10647571086883545,
  27. "lastError" : {
  28. "errorNum" : 0
  29. },
  30. "time" : "2019-03-01T11:36:34Z"
  31. },
  32. "server" : {
  33. "version" : "3.4.4",
  34. "serverId" : "46402312160836"
  35. },
  36. "endpoint" : "tcp://master.example.org",
  37. "database" : "test"
  38. }

The running attribute indicates whether the replication applier of the currentdatabase is currently running and polling the master at endpoint for new events.

The started attribute shows at what date and time the applier was started (if at all).

The progress.failedConnects attribute shows how many failed connection attemptsthe replication applier currently has encountered in a row. In contrast, thetotalFailedConnects attribute indicates how many failed connection attempts theapplier has made in total. The totalRequests attribute shows how many requeststhe applier has sent to the master database in total.

The totalEvents attribute shows how many log events the applier has read from the master. The totalDocuments and totalRemovals attributes indicate how may documentoperations the slave has applied locally.

The attributes totalApplyTime and totalFetchTime show the total time the applierspent for applying data batches locally, and the total time the applier waited ondata-fetching requests to the master, respectively.The averageApplyTime and averageFetchTime attributes show the average times clockedfor these operations. Note that the average times will greatly be influenced by thechunk size used in the applier configuration (bigger chunk sizes mean less requests tothe slave, but the batches will include more data and take more time to createand apply).

The progress.message sub-attribute provides a brief hint of what the applier_currently does (if it is running). The _lastError attribute also has an optionalerrorMessage sub-attribute, showing the latest error message. The errorNum_sub-attribute of the _lastError attribute can be used by clients to programmaticallycheck for errors. It should be 0 if there is no error, and it should be non-zeroif the applier terminated itself due to a problem.

Below is an example of the state after the replication applier terminated itselfdue to (repeated) connection problems:

  1. {
  2. "state" : {
  3. "started" : "2019-03-01T11:51:18Z",
  4. "running" : false,
  5. "phase" : "inactive",
  6. "lastAppliedContinuousTick" : "2101606350",
  7. "lastProcessedContinuousTick" : "2101606370",
  8. "lastAvailableContinuousTick" : "2101606370",
  9. "safeResumeTick" : "2101606350",
  10. "progress" : {
  11. "time" : "2019-03-01T11:52:45Z",
  12. "message" : "applier shut down",
  13. "failedConnects" : 6
  14. },
  15. "totalRequests" : 19,
  16. "totalFailedConnects" : 6,
  17. "totalEvents" : 0,
  18. "totalDocuments" : 0,
  19. "totalRemovals" : 0,
  20. "totalResyncs" : 0,
  21. "totalOperationsExcluded" : 0,
  22. "totalApplyTime" : 0,
  23. "averageApplyTime" : 0,
  24. "totalFetchTime" : 0.03386974334716797,
  25. "averageFetchTime" : 0.0028224786122639975,
  26. "lastError" : {
  27. "errorNum" : 1400,
  28. "time" : "2019-03-01T11:52:45Z",
  29. "errorMessage" : "could not connect to master at tcp://127.0.0.1:8529 for URL /_api/wal/tail?chunkSize=6291456&barrier=0&from=2101606369&lastScanned=2101606370&serverId=46402312160836&includeSystem=true&includeFoxxQueues=false: Could not connect to 'http+tcp://127.0.0.1:852..."
  30. },
  31. "time" : "2019-03-01T11:52:56Z"
  32. },
  33. "server" : {
  34. "version" : "3.4.4",
  35. "serverId" : "46402312160836"
  36. },
  37. "endpoint" : "tcp://master.example.org",
  38. "database" : "test"
  39. }

Note: the state of a database’s replication applier is queryable via the HTTPAPI, too. Please refer to HTTP Interface for Replicationfor more details.