Draining Kubernetes nodes

Draining Kubernetes nodes

If Kubernetes nodes with ArangoDB pods on them are drained without caredata loss can occur! The recommended procedure is described below.

For maintenance work in k8s it is sometimes necessary to drain a k8s node,which means removing all pods from it. Kubernetes offers a standard APIfor this and our operator supports this - to the best of its ability.

Draining nodes is easy enough for stateless services, which can simply bere-launched on any other node. However, for a stateful service thisoperation is more difficult, and as a consequence more costly and thereare certain risks involved, if the operation is not done carefullyenough. To put it simply, the operator must first move all the datastored on the node (which could be in a locally attached disk) toanother machine, before it can shut down the pod gracefully. Moving datatakes time, and even after the move, the distributed system ArangoDB hasto recover from this change, for example by ensuring data synchronicitybetween the replicas in their new location.

Therefore, a systematic drain of all k8s nodes in sequence has to followa careful procedure, in particular to ensure that ArangoDB is ready tomove to the next step. This is necessary to avoid catastrophic dataloss, and is simply the price one pays for running a stateful service.

Anatomy of a drain procedure in k8s: the grace period

When a kubectl drain operation is triggered for a node, k8s firstchecks if there are any pods with local data on disk. Our ArangoDB pods havethis property (the Coordinators do use EmptyDir volumes, and Agents_and _DBServers could have persistent volumes which are actually stored ona locally attached disk), so one has to override this with the—delete-local-data=true option.

Furthermore, quite often, the node will contain pods which are managedby a DaemonSet (which is not the case for ArangoDB), which makes itnecessary to override this check with the —ignore-daemonsets=trueoption.

Finally, it is checked if the node has any pods which are not managed byanything, either by k8s itself (ReplicationController, ReplicaSet,Job, DaemonSet or StatefulSet) or by an operator. If this is thecase, the drain operation will be refused, unless one uses the option—force=true. Since the ArangoDB operator manages our pods, we do nothave to use this option for ArangoDB, but you might have to use it forother pods.

If all these checks have been overcome, k8s proceeds as follows: Allpods are notified about this event and are put into a Terminatingstate. During this time, they have a chance to take action, or indeedthe operator managing them has. In particular, although the pods gettermination notices, they can keep running until the operator hasremoved all finalizers. This gives the operator a chance to sort outthings, for example in our case to move data away from the pod.

However, there is a limit to this tolerance by k8s, and that is thegrace period. If the grace period has passed but the pod has notactually terminated, then it is killed the hard way. If this happens,the operator has no chance but to remove the pod, drop its persistentvolume claim and persistent volume. This will obviously lead to afailure incident in ArangoDB and must be handled by fail-over management.Therefore, this event should be avoided.

Things to check in ArangoDB before a node drain

There are basically two things one should check in an ArangoDB clusterbefore a node drain operation can be started:

All cluster nodes are up and running and healthy.
For all collections and shards all configured replicas are in sync.

If any cluster node is unhealthy, there is an increased risk that thesystem does not have enough resources to cope with a failure situation.

If any shard replicas are not currently in sync, then there is a seriousrisk that the cluster is currently not as resilient as expected.

One possibility to verify these two things is via the ArangoDB web interface.Node health can be monitored in the Overview tab under NODES:

Check that all nodes are green and that there is no node error in thetop right corner.

As to the shards being in sync, see the Shards tab under NODES:

Check that all collections have a green check mark on the right side.If any collection does not have such a check mark, you can click on thecollection and see the details about shards. Please keep inmind that this has to be done for each database separately!

Obviously, this might be tedious and calls for automation. Therefore, thereare APIs for this. The first one is Cluster Health:

POST /_admin/cluster/health

… which returns a JSON document looking like this:

{
  "Health": {
    "CRDN-rxtu5pku": {
      "Endpoint": "ssl://my-arangodb-cluster-coordinator-rxtu5pku.my-arangodb-cluster-int.default.svc:8529",
      "LastAckedTime": "2019-02-20T08:09:22Z",
      "SyncTime": "2019-02-20T08:09:21Z",
      "Version": "3.4.2-1",
      "Engine": "rocksdb",
      "ShortName": "Coordinator0002",
      "Timestamp": "2019-02-20T08:09:22Z",
      "Status": "GOOD",
      "SyncStatus": "SERVING",
      "Host": "my-arangodb-cluster-coordinator-rxtu5pku.my-arangodb-cluster-int.default.svc",
      "Role": "Coordinator",
      "CanBeDeleted": false
    },
    "PRMR-wbsq47rz": {
      "LastAckedTime": "2019-02-21T09:14:24Z",
      "Endpoint": "ssl://my-arangodb-cluster-dbserver-wbsq47rz.my-arangodb-cluster-int.default.svc:8529",
      "SyncTime": "2019-02-21T09:14:24Z",
      "Version": "3.4.2-1",
      "Host": "my-arangodb-cluster-dbserver-wbsq47rz.my-arangodb-cluster-int.default.svc",
      "Timestamp": "2019-02-21T09:14:24Z",
      "Status": "GOOD",
      "SyncStatus": "SERVING",
      "Engine": "rocksdb",
      "ShortName": "DBServer0006",
      "Role": "DBServer",
      "CanBeDeleted": false
    },
    "AGNT-wrqmwpuw": {
      "Endpoint": "ssl://my-arangodb-cluster-agent-wrqmwpuw.my-arangodb-cluster-int.default.svc:8529",
      "Role": "Agent",
      "CanBeDeleted": false,
      "Version": "3.4.2-1",
      "Engine": "rocksdb",
      "Leader": "AGNT-oqohp3od",
      "Status": "GOOD",
      "LastAckedTime": 0.312
    },
    ... [some more entries, one for each instance]
  },
  "ClusterId": "210a0536-fd28-46de-b77f-e8882d6d7078",
  "error": false,
  "code": 200
}

Check that each instance has a Status field with the value "GOOD".Here is a shell command which makes this check easy, using thejq JSON pretty printer:

curl -k https://arangodb.9hoeffer.de:8529/_admin/cluster/health --user root: | jq . | grep '"Status"' | grep -v '"GOOD"'

For the shards being in sync there is theCluster InventoryAPI call:

POST /_db/_system/_api/replication/clusterInventory

… which returns a JSON body like this:

{
  "collections": [
    {
      "parameters": {
        "cacheEnabled": false,
        "deleted": false,
        "globallyUniqueId": "c2010061/",
        "id": "2010061",
        "isSmart": false,
        "isSystem": false,
        "keyOptions": {
          "allowUserKeys": true,
          "type": "traditional"
        },
        "name": "c",
        "numberOfShards": 6,
        "planId": "2010061",
        "replicationFactor": 2,
        "shardKeys": [
          "_key"
        ],
        "shardingStrategy": "hash",
        "shards": {
          "s2010066": [
            "PRMR-vzeebvwf",
            "PRMR-e6hbjob1"
          ],
          "s2010062": [
            "PRMR-e6hbjob1",
            "PRMR-vzeebvwf"
          ],
          "s2010065": [
            "PRMR-e6hbjob1",
            "PRMR-vzeebvwf"
          ],
          "s2010067": [
            "PRMR-vzeebvwf",
            "PRMR-e6hbjob1"
          ],
          "s2010064": [
            "PRMR-vzeebvwf",
            "PRMR-e6hbjob1"
          ],
          "s2010063": [
            "PRMR-e6hbjob1",
            "PRMR-vzeebvwf"
          ]
        },
        "status": 3,
        "type": 2,
        "waitForSync": false
      },
      "indexes": [],
      "planVersion": 132,
      "isReady": true,
      "allInSync": true
    },
    ... [more collections following]
  ],
  "views": [],
  "tick": "38139421",
  "state": "unused"
}

Check that for all collections the attributes "isReady" and "allInSync"both have the value true. Note that it is necessary to do this for alldatabases!

Here is a shell command which makes this check easy:

curl -k https://arangodb.9hoeffer.de:8529/_db/_system/_api/replication/clusterInventory --user root: | jq . | grep '"isReady"\|"allInSync"' | sort | uniq -c

If all these checks are performed and are okay, then it is safe tocontinue with the clean out and drain procedure as described below.

If there are some collections with replicationFactor set to1, the system is not resilient and cannot tolerate the failure of even asingle server! One can still perform a drain operation in this case, butif anything goes wrong, in particular if the grace period is chosen tooshort and a pod is killed the hard way, data loss can happen.

If all replicationFactors of all collections are at least 2, then thesystem can tolerate the failure of a single DBserver. If you have setthe Environment to Production in the specs of the ArangoDBdeployment, you will only ever have one DBserver on each k8s node andtherefore the drain operation is relatively safe, even if the graceperiod is chosen too small.

Furthermore, we recommend to have one k8s node more than DBservers inyou cluster, such that the deployment of a replacement DBServer canhappen quickly and not only after the maintenance work on the drainednode has been completed. However, with the necessary care describedbelow, the procedure should also work without this.

Finally, one should not run a rolling upgrade or restart operationat the time of a node drain.

Clean out a DBserver manually

In this step we clean out a DBServer manually, before issuing thekubectl drain command. Previously, we have denoted this step as optional,but for safety reasons, we consider it mandatory now, since it is nearimpossible to choose the grace period long enough in a reliable way.

Furthermore, if this step is not performed, we must choosethe grace period long enough to avoid any risk, as explained in theprevious section. However, this has a disadvantage which has nothing todo with ArangoDB: We have observed, that some k8s internal services likefluentd and some DNS services will always wait for the full graceperiod to finish a node drain. Therefore, the node drain operation willalways take as long as the grace period. Since we have to choose thisgrace period long enough for ArangoDB to move all data on the _DBServer_pod away to some other node, this can take a considerable amount oftime, depending on the size of the data you keep in ArangoDB.

Therefore it is more time-efficient to perform the clean-out operationbeforehand. One can observe completion and as soon as it is completedsuccessfully, we can then issue the drain command with a relativelysmall grace period and still have a nearly risk-free procedure.

To clean out a DBServer manually, we have to use this API:

POST /_admin/cluster/cleanOutServer

… and send as body a JSON document like this:

{"server":"DBServer0006"}

The value of the "server" attribute should be the name of the DBserverwhich is the one in the pod which resides on the node that shall bedrained next. This uses the UI short name (ShortName in the/_admin/cluster/health API), alternatively one can use theinternal name, which corresponds to the pod name. In our example, thepod name is:

my-arangodb-cluster-prmr-wbsq47rz-5676ed

… where my-arangodb-cluster is the ArangoDB deployment name, thereforethe internal name of the DBserver is PRMR-wbsq47rz. Note that PRMRmust be all capitals since pod names are always all lower case. So, wecould use the body:

{"server":"PRMR-wbsq47rz"}

You can use this command line to achieve this:

curl -k https://arangodb.9hoeffer.de:8529/_admin/cluster/cleanOutServer --user root: -d '{"server":"PRMR-wbsq47rz"}'

The API call will return immediately with a body like this:

{"error":false,"id":"38029195","code":202}

The given id in this response can be used to query the outcome orcompletion status of the clean out server job with this API:

GET /_admin/cluster/queryAgencyJob?id=38029195

… which will return a body like this:

{
  "error": false,
  "id": "38029195",
  "status": "Pending",
  "job": {
    "timeCreated": "2019-02-21T10:42:14.727Z",
    "server": "PRMR-wbsq47rz",
    "timeStarted": "2019-02-21T10:42:15Z",
    "type": "cleanOutServer",
    "creator": "CRDN-rxtu5pku",
    "jobId": "38029195"
  },
  "code": 200
}

Use this command line to check progress:

curl -k https://arangodb.9hoeffer.de:8529/_admin/cluster/queryAgencyJob?id=38029195 --user root:

It indicates that the job is still ongoing ("Pending"). As soon asthe job has completed, the answer will be:

{
  "error": false,
  "id": "38029195",
  "status": "Finished",
  "job": {
    "timeCreated": "2019-02-21T10:42:14.727Z",
    "server": "PRMR-e6hbjob1",
    "jobId": "38029195",
    "timeStarted": "2019-02-21T10:42:15Z",
    "timeFinished": "2019-02-21T10:45:39Z",
    "type": "cleanOutServer",
    "creator": "CRDN-rxtu5pku"
  },
  "code": 200
}

From this moment on the DBserver can no longer be used to moveshards to. At the same time, it will no longer hold any data of thecluster.

Now the drain operation involving a node with this pod on it iscompletely risk-free, even with a small grace period.

Performing the drain

After all above checks before a node drainand the manual clean out of the DBServerhave been done successfully, it is safe to perform the drain operation, similar to this command:

kubectl drain gke-draintest-default-pool-394fe601-glts --delete-local-data --ignore-daemonsets --grace-period=300

As described above, the options —delete-local-data for ArangoDB and—ignore-daemonsets for other services have been added. A —grace-period of300 seconds has been chosen because for this example we are confident that all the data on our DBServer podcan be moved to a different server within 5 minutes. Note that this isnot saying that 300 seconds will always be enough. Regardless of howmuch data is stored in the pod, your mileage may vary, moving a terabyteof data can take considerably longer!

If the highly recommended step ofcleaning out a DBserver manuallyhas been performed beforehand, the grace period can easily be reduced to 60seconds - at least from the perspective of ArangoDB, since the server is alreadycleaned out, so it can be dropped readily and there is still no risk.

At the same time, this guarantees now that the drain is completedapproximately within a minute.

Things to check after a node drain

After a node has been drained, there will usually be one of theDBservers gone from the cluster. As a replacement, another DBServer hasbeen deployed on a different node, if there is a different nodeavailable. If not, the replacement can only be deployed when themaintenance work on the drained node has been completed and it isuncordoned again. In this latter case, one should wait until the node isback up and the replacement pod has been deployed there.

After that, one should perform the same checks as described inthings to check before a node drainabove.

Finally, it is likely that the shard distribution in the “new” clusteris not balanced out. In particular, the new DBSserver is not automaticallyused to store shards. We recommend tore-balance the shard distribution,either manually by moving shards or by using the Rebalance Shards_button in the _Shards tab under NODES in the web UI. This redistribution can takesome time again and progress can be monitored in the UI.

After all this has been done, another round of checks should be donebefore proceeding to drain the next node.