Amazon DocumentDB Failover

In certain cases, such as certain types of planned maintenance, or in the unlikely event of a primary node or Availability Zone failure, Amazon DocumentDB (with MongoDB compatibility) detects the failure and replaces the primary node. During a failover, write down time is minimized. This is because the role of primary node fails over to one of the read replicas instead of having to create and provision a new primary node. This failure detection and replica promotion ensure that you can resume writing to the new primary as soon as promotion is complete.

For failover to function, your cluster must have at least two instances — a primary and at least one replica instance.

Controlling the Failover Target

Amazon DocumentDB provides you with failover tiers as a means to control which replica instance is promoted to primary when a failover occurs.

Failover Tiers

Each replica instance is associated with a failover tier (0–15). When a failover occurs due to maintenance or an unlikely hardware failure, the primary instance fails over to a replica with the lowest numbered priority tier. If multiple replicas have the same priority tier, the primary fails over to that tier’s replica that is the closest in size to the primary.

By setting the failover tier for a group of select replicas to 0 (the highest priority), you can ensure that a failover will promote one of the replicas in that group. You can effectively prevent specific replicas from being promoted to primary in case of a failover by assigning a low-priority tier (high number) to these replicas. This is useful in cases where specific replicas are receiving heavy use by an application and failing over to one of them would negatively impact a critical application.

You can set the failover tier of an instance when you create it or later by modifying it. Setting an instance failover tier by modifying the instance does not trigger a failover. For more information see the following topics:

When manually initiating a failover, you have two means to control which replica instance is promoted to primary: the failover tiers as previously described, and the --target-db-instance-identifier parameter.

--target-db-instance-identifier

For testing, you can force a failover event using the failover-db-cluster operation. You can use the --target-db-instance-identifier parameter to specify which replica to promote to primary. Using the --target-db-instance-identifier parameter supersedes the failover priority tier. If you do not specify the --target-db-instance-identifier parameter, the primary failover is in accordance with the failover priority tier.

What Happens During a Failover

Failover is automatically handled by Amazon DocumentDB so that your applications can resume database operations as quickly as possible without administrative intervention.

  • If you have an Amazon DocumentDB replica instance in the same or different Availability Zone when failing over: Amazon DocumentDB flips the canonical name record (CNAME) for your instance to point at the healthy replica, which is, in turn, promoted to become the new primary. Failover typically completes within 30 seconds from start to finish.

  • If you don’t have an Amazon DocumentDB replica instance (for example, a single instance cluster): Amazon DocumentDB will attempt to create a new instance in the same Availability Zone as the original instance. This replacement of the original instance is done on a best-effort basis and may not succeed if, for example, there is an issue that is broadly affecting the Availability Zone.

Your application should retry database connections in the event of a connection loss.

Testing Failover

A failover for a cluster promotes one of the Amazon DocumentDB replicas (read-only instances) in the cluster to be the primary instance (the cluster writer).

When the primary instance fails, Amazon DocumentDB automatically fails over to an Amazon DocumentDB replica, if one exists. You can force a failover when you want to simulate a failure of a primary instance for testing. Each instance in a cluster has its own endpoint address. Therefore, you need to clean up and re-establish any existing connections that use those endpoint addresses when the failover is complete.

To force a failover, use the failover-db-cluster operation with these parameters.

  • --db-cluster-identifier—Required. The name of the cluster to fail over.

  • --target-db-instance-identifier—Optional. The name of the instance to be promoted to the primary instance.

The following operation forces a failover of the sample-cluster cluster. It does not specify which instance to make the new primary instance, so Amazon DocumentDB chooses the instance according to failover tier priority.

For Linux, macOS, or Unix:

  1. aws docdb failover-db-cluster \
  2. --db-cluster-identifier sample-cluster

For Windows:

  1. aws docdb failover-db-cluster ^
  2. --db-cluster-identifier sample-cluster

The following operation forces a failover of the sample-cluster cluster, specifying that sample-cluster-instance is to be promoted to the primary role. (Notice "IsClusterWriter": true in the output.)

For Linux, macOS, or Unix:

  1. aws docdb failover-db-cluster \
  2. --db-cluster-identifier sample-cluster \
  3. --target-db-instance-identifier sample-cluster-instance

For Windows:

  1. aws docdb failover-db-cluster ^
  2. --db-cluster-identifier sample-cluster ^
  3. --target-db-instance-identifier sample-cluster-instance

Output from this operation looks something like the following (JSON format).

  1. {
  2. "DBCluster": {
  3. "HostedZoneId": "Z2SUY0A1719RZT",
  4. "Port": 27017,
  5. "EngineVersion": "3.6.0",
  6. "PreferredMaintenanceWindow": "thu:04:05-thu:04:35",
  7. "BackupRetentionPeriod": 1,
  8. "ClusterCreateTime": "2018-06-28T18:53:29.455Z",
  9. "AssociatedRoles": [],
  10. "DBSubnetGroup": "default",
  11. "MasterUsername": "master-user",
  12. "Engine": "docdb",
  13. "ReadReplicaIdentifiers": [],
  14. "EarliestRestorableTime": "2018-08-21T00:04:10.546Z",
  15. "DBClusterIdentifier": "sample-cluster",
  16. "ReaderEndpoint": "sample-cluster.node.us-east-1.docdb.amazonaws.com",
  17. "DBClusterMembers": [
  18. {
  19. "DBInstanceIdentifier": "sample-cluster-instance",
  20. "DBClusterParameterGroupStatus": "in-sync",
  21. "PromotionTier": 1,
  22. "IsClusterWriter": true
  23. },
  24. {
  25. "DBInstanceIdentifier": "sample-cluster-instance-00",
  26. "DBClusterParameterGroupStatus": "in-sync",
  27. "PromotionTier": 1,
  28. "IsClusterWriter": false
  29. },
  30. {
  31. "DBInstanceIdentifier": "sample-cluster-instance-01",
  32. "DBClusterParameterGroupStatus": "in-sync",
  33. "PromotionTier": 1,
  34. "IsClusterWriter": false
  35. }
  36. ],
  37. "AvailabilityZones": [
  38. "us-east-1b",
  39. "us-east-1c",
  40. "us-east-1a"
  41. ],
  42. "DBClusterParameterGroup": "default.docdb3.6",
  43. "Endpoint": "sample-cluster.node.us-east-1.docdb.amazonaws.com",
  44. "IAMDatabaseAuthenticationEnabled": false,
  45. "AllocatedStorage": 1,
  46. "LatestRestorableTime": "2018-08-22T21:57:33.904Z",
  47. "PreferredBackupWindow": "00:00-00:30",
  48. "StorageEncrypted": false,
  49. "MultiAZ": true,
  50. "Status": "available",
  51. "DBClusterArn": "arn:aws:rds:us-east-1:123456789012:cluster:sample-cluster",
  52. "VpcSecurityGroups": [
  53. {
  54. "Status": "active",
  55. "VpcSecurityGroupId": "sg-12345678"
  56. }
  57. ],
  58. "DbClusterResourceId": "cluster-ABCDEFGHIJKLMNOPQRSTUVWXYZ"
  59. }
  60. }