Manually Upgrading an Active Failover Deployment

Manually Upgrading an Active Failover Deployment

This page will guide you through the process of a manual upgrade of an Active Failoversetup. The different nodes can be upgraded one at a time withoutincurring a prolonged downtime of the entire system. The downtimes of the individual nodesshould also stay fairly low.

The manual upgrade procedure described in this section can be used to upgradeto a new hotfix version, or to perform an upgrade to a new minor version of ArangoDB.Please refer to the Upgrade Paths sectionfor detailed information.

Preparations

The ArangoDB installation packages (e.g. for Debian or Ubuntu) set up aconvenient standalone instance of arangod. During installation, this instance’sdatabase will be upgraded (see —database.auto-upgrade)and the service will be (re)started.

You have to make sure that your Active Failover deployment is independent of thisstandalone instance. Specifically, make sure that the database directory aswell as the socket used by the standalone instance provided by the package areseparate from the ones in your Active Failover configuration. Also, that you haven’tmodified the init script or systemd unit file for the standalone instance in a waythat it would start or stop your Active Failover instance instead.

Install the new ArangoDB version binary

The first step is to install the new ArangoDB package.

Note: you do not have to stop the Active Failover (arangod) processes before upgrading it.

For example, if you want to upgrade to 3.3.16 on Debian or Ubuntu, either call

$ apt install arangodb=3.3.16-1

(apt-get on older versions) if you have added the ArangoDB repository. Orinstall a specific package using

$ dpkg -i arangodb3-3.3.16-1_amd64.deb

after you have downloaded the corresponding file fromdownload.arangodb.com.

Stop the Standalone Instance

As the package will automatically start the standalone instance, you might want tostop that instance now, as otherwise it can create some confusion later. As you are starting the Active Failover processes manuallyyou will not need the automatically installed and started standalone instance, and you should hence stop it via:

$ service arangodb3 stop

Also, you might want to remove the standalone instance from the defaultrunlevels to prevent it from starting on the next reboots of your machine. How thisis done depends on your distribution and init system. For example, on older Debianand Ubuntu systems using a SystemV-compatible init, you can use:

$ update-rc.d -f arangodb3 remove

Set supervision into maintenance mode

Important: Supervision maintenance mode is supported from ArangoDB versions 3.3.8/3.2.14 or higher.

You have two main choices when performing an upgrade of the Active Failover setup:

Upgrade while incurring a leader-to-follower switch (with reduced downtime)
An upgrade with no leader-to-follower switch.Turning the maintenance mode on will enable the latter case. You might have a short downtime during the leader upgrade, but there will be no potential loss of acknowledged operations.

To enable the maintenance mode means to essentially disable the Agency supervision for a limited amount of time during the upgrade procedure. The following API calls will activate and deactivate the maintenance mode of the supervision job. You might use curl to send the API calls.The following examples assume there is an Active Failover node running on localhost on port 7002.

Activate Maintenance mode

curl -u username:password <single-server>/_admin/cluster/maintenance -XPUT -d'"on"'

For example:

curl -u "root:" http://localhost:7002/_admin/cluster/maintenance -XPUT -d'"on"'
{"error":false,"warning":"Cluster supervision deactivated. 
It will be reactivated automatically in 60 minutes unless this call is repeated until then."}

Note: In case the manual upgrade takes longer than 60 minutes, the API call has to be resent.

Deactivate Maintenance mode

The cluster supervision resumes automatically 60 minutes after disabling it.It can be manually reactivated earlier at any point using the following API call:

curl -u username:password <single-server>/_admin/cluster/maintenance -XPUT -d'"off"'

For example:

curl -u "root:" http://localhost:7002/_admin/cluster/maintenance -XPUT -d'"off"'
{"error":false,"warning":"Cluster supervision reactivated."}

Upgrade the Active Failover processes

Now all the Active Failover (Agents, Single-Server) processes (arangod) have to beupgraded on each node.

Note: Please read the section regarding the maintenance mode above.

In order to stop an arangod process we will need to use a command like kill -15:

kill -15 <pid-of-arangod-process>

The pid associated to your Active Failover setup can be checked using a command like ps:

ps -C arangod -fww

The output of the command above does not only show the process ids of all arangod processes but also the used commands, which is useful for the followingrestarts of all arangod processes.

The output below is from a test machine where three Agents and two _Single-Servers_were running locally. In a more production-like scenario, you will find only one instance of each type running per machine:

ps -C arangod -fww
UID        PID  PPID  C STIME TTY          TIME CMD
max      29075  8072  0 13:50 pts/2    00:00:42 arangod --server.endpoint tcp://0.0.0.0:5001 --agency.my-address=tcp://127.0.0.1:5001 --server.authentication false --agency.activate true --agency.size 3 --agency.endpoint tcp://127.0.0.1:5001 --agency.supervision true --log.file a1 --javascript.app-path /tmp --database.directory agent1
max      29208  8072  2 13:51 pts/2    00:02:08 arangod --server.endpoint tcp://0.0.0.0:5002 --agency.my-address=tcp://127.0.0.1:5002 --server.authentication false --agency.activate true --agency.size 3 --agency.endpoint tcp://127.0.0.1:5001 --agency.supervision true --log.file a2 --javascript.app-path /tmp --database.directory agent2
max      29329 16224  0 13:51 pts/3    00:00:42 arangod --server.endpoint tcp://0.0.0.0:5003 --agency.my-address=tcp://127.0.0.1:5003 --server.authentication false --agency.activate true --agency.size 3 --agency.endpoint tcp://127.0.0.1:5001 --agency.supervision true --log.file a3 --javascript.app-path /tmp --database.directory agent3
max      29824 16224  1 13:55 pts/3    00:01:53 arangod --server.authentication=false --server.endpoint tcp://0.0.0.0:7001 --cluster.my-address tcp://127.0.0.1:7001 --cluster.my-role SINGLE --cluster.agency-endpoint tcp://127.0.0.1:5001 --cluster.agency-endpoint tcp://127.0.0.1:5002 --cluster.agency-endpoint tcp://127.0.0.1:5003 --log.file c1 --javascript.app-path /tmp --database.directory single1
max      29938 16224  2 13:56 pts/3    00:02:13 arangod --server.authentication=false --server.endpoint tcp://0.0.0.0:7002 --cluster.my-address tcp://127.0.0.1:7002 --cluster.my-role SINGLE --cluster.agency-endpoint tcp://127.0.0.1:5001 --cluster.agency-endpoint tcp://127.0.0.1:5002 --cluster.agency-endpoint tcp://127.0.0.1:5003 --log.file c2 --javascript.app-path /tmp --database.directory single2

Note: The start commands of Agent and Single Server are required for restarting the processes later.

The recommended procedure for upgrading an Active Failover setup is to stop, upgrade and restart the arangod instances one by one on all participating servers, starting first with all Agent instances, and then following with the Active Failover instances themselves. When upgrading the Active Failover instances, the followers shouldbe upgraded first.

To figure out the node containing the followers you can consult the cluster endpoints API:

curl http://<single-server>:7002/_api/cluster/endpoints

This will yield a list of endpoints, the first of which is always the leader node.

Stopping, upgrading and restarting an instance

To stop an instance, the currently running process has to be identified using the pscommand above.

Let’s assume we are about to upgrade an Agent instance, so we have to look in the psoutput for an agent instance first, and note its process id (pid) and start command.

The process can then be stopped using the following command:

kill -15 <pid-of-agent>

The instance then has to be upgraded using the same command that was used before (in the ps output), but with the additional option:

--database.auto-upgrade=true

After the upgrade procecure has finishing successfully, the instance will remain stopped.So it has to be restarted using the command from the ps output before(this time without the —database.auto-upgrade option).

Once an Agent was upgraded and restarted successfully, repeat the procedure for theother Agent instances in the setup and then repeat the procedure for the _Active Failover_instances, there starting with the followers.

Final words

The Agency supervision then needs to be reactivated by issuing the following API call to the leader: