Replication

Replication allows you to replicate data onto another machine. Itforms the base of all disaster recovery and failover features ArangoDBoffers.

ArangoDB offers synchronous and asynchronous replication.

Synchronous replication is used between the DBServers of an ArangoDBCluster.

Asynchronous replication is used:

  • between the master and the slave of an ArangoDB Master/Slave setup
  • between the Leader and the Follower of an ArangoDB Active Failover setup
  • between multiple ArangoDB Data Centers (inside the same Data Center replication is synchronous)

Synchronous replication

Synchronous replication only works within an ArangoDB Cluster and is typicallyused for mission critical data which must be accessible at alltimes. Synchronous replication generally stores a copy of a shard’sdata on another DBServer and keeps it in sync. Essentially, when storingdata after enabling synchronous replication the Cluster will wait forall replicas to write all the data before greenlighting the writeoperation to the client. This will naturally increase the latency abit, since one more network hop is needed for each write. However, itwill enable the cluster to immediately fail over to a replica wheneveran outage has been detected, without losing any committed data, andmostly without even signaling an error condition to the client.

Synchronous replication is organized such that every shard has aleader and r-1 followers, where r denoted the replicationfactor. The number of followers can be controlled using thereplicationFactor parameter whenever you create a collection, thereplicationFactor parameter is the total number of copies beingkept, that is, it is one plus the number of followers.

Asynchronous replication

In ArangoDB any write operation is logged in the write-aheadlog.

When using asynchronous replication slaves (or followers)connect to a master (or leader) and apply locally all the events fromthe master log in the same order. As a result the slaves (followers) will have the same state of data as the master (leader).

Slaves (followers) are only eventually consistent with the master (leader).

Transactions are honored in replication, i.e. transactional write operations will become visible on slaves atomically.

As all write operations will be logged to a master database’s write-ahead log, the replication in ArangoDB currently cannot be used for write-scaling. The main purposes of the replication in current ArangoDB are to provide read-scalability and “hot backups” of specific databases.

It is possible to connect multiple slave to the same master. Slaves should be usedas read-only instances, and no user-initiated write operations should be carried out on them. Otherwise data conflicts may occur that cannot be solved automatically, and that will make the replication stop.

In an asynchronous replication scenario slaves will pull changes from the master. Slaves need to know to which master they should connect to, but a master is not aware of the slaves that replicate from it. When the network connection between the master and a slave goes down, write operations on the master can continue normally. When the network is up again, slaves can reconnect to the master and transfer the remaining changes. This will happen automatically provided slaves are configured appropriately.

Before 3.3.0 asynchronous replication was per database. Starting with 3.3.0 it is possibleto setup global replication.

Replication lag

As decribed above, write operations are applied first in the master, and then applied in the slaves.

For example, let’s assume a write operation is executed in the master at point in time t0. To make a slave apply the same operation, it must first fetch the write operation’s data from master’s write-ahead log, then parse it and apply it locally. This will happen at some point in time after t0, let’s say t1.

The difference between t1 and t0 is called the replication lag, and it is unavoidable in asynchronous replication. The amount of replication lag depends on many factors, a few of which are:

  • the network capacity between the slaves and the master
  • the load of the master and the slaves
  • the frequency in which slaves poll the master for updatesBetween t0 and t1, the state of data on the master is newer than the state of dataon the slaves. At point in time t1, the state of data on the master and slaves_is consistent again (provided no new data modifications happened on the _master inbetween). Thus, the replication will lead to an eventually consistent state of data.

Replication overhead

As the master servers are logging any write operation in the write-ahead-log anyway replication doesn’t cause any extra overhead on the master. However it will of course cause some overhead for the master to serve incoming read requests of the slaves. Returning the requested data is however a trivial task for the master and should not result in a notable performance degration in production.