Back Up a Sharded Cluster with File System Snapshots

Back Up a Sharded Cluster with File System Snapshots

Overview

This document describes a procedure for taking a backup of allcomponents of a sharded cluster. This procedure uses file systemsnapshots to capture a copy of the mongod instance.

Important

To capture a point-in-time backup from a shardedcluster you must stop all writes to the cluster. On a runningproduction system, you can only capture an approximation ofpoint-in-time snapshot.

For more information on backups in MongoDB and backups of shardedclusters in particular, see MongoDB Backup Methods andBackup and Restore Sharded Clusters.

Considerations

Encrypted Storage Engine (MongoDB Enterprise Only)

For encrypted storage engines thatuse AES256-GCM encryption mode, AES256-GCM requires that everyprocess use a unique counter block value with the key.

For encrypted storage engineconfigured with AES256-GCM cipher:

- Restoring from Hot Backup
- Starting in 4.2, if you restore from files taken via “hot”backup (i.e. the mongod is running), MongoDBcan detect “dirty” keys on startup and automatically rolloverthe database key to avoid IV (Initialization Vector) reuse.
- Restoring from Cold Backup
- However, if you restore from files taken via “cold” backup(i.e. the mongod is not running), MongoDBcannot detect “dirty” keys on startup, and reuse of IV voidsconfidentiality and integrity guarantees.

Starting in 4.2, to avoid the reuse of the keys afterrestoring from a cold filesystem snapshot, MongoDB adds a newcommand-line option —eseDatabaseKeyRollover. When started with the—eseDatabaseKeyRollover option, the mongodinstance rolls over the database keys configured withAES256-GCM cipher and exits.

Tip

In general, if using filesystem based backups for MongoDBEnterprise 4.2+, use the “hot” backup feature, if possible.
For MongoDB Enterprise versions 4.0 and earlier, if you useAES256-GCM encryption mode, do not make copies ofyour data files or restore from filesystem snapshots (“hot” or“cold”).

Balancer

It is essential that you stop the balancer before capturing a backup.

If the balancer is active while you capture backups, the backupartifacts may be incomplete and/or have duplicate data, as chunks may migrate while recording backups.

Precision

In this procedure, you will stop the cluster balancer and take a backupup of the config database, and then take backups of eachshard in the cluster using a file-system snapshot tool. If you need anexact moment-in-time snapshot of the system, you will need to stop allapplication writes before taking the file system snapshots; otherwisethe snapshot will only approximate a moment in time.

For approximate point-in-time snapshots, you can minimize the impact onthe cluster by taking the backup from a secondary member of eachreplica set shard.

Consistency

If the journal and data files are on the same logical volume, you canuse a single point-in-time snapshot to capture a consistent copy of thedata files.

If the journal and data files are on different file systems, you mustuse db.fsyncLock() and db.fsyncUnlock() to ensurethat the data files do not change, providing consistency for thepurposes of creating backups.

Snapshots with Amazon EBS in a RAID 10 Configuration

If your deployment depends on Amazon’s Elastic Block Storage (EBS) withRAID configured within your instance, it is impossible to get aconsistent state across all disks using the platform’s snapshot tool. Asan alternative, you can do one of the following:

Flush all writes to disk and create a write lock to ensureconsistent state during the backup process.

If you choose this option see Back up Instances with Journal Files on Separate Volume or without Journaling.

Configure LVM to run and hold your MongoDB data files on top of theRAID within your system.

If you choose this option, perform the LVM backup operation describedin Create a Snapshot.

Procedure

Disable the balancer.

Connect a mongo shell to a clustermongos instance. Use the sh.stopBalancer()method to stop the balancer. If a balancing round is in progress, theoperation waits for balancing to complete before stopping thebalancer.

use config
sh.stopBalancer()

Starting in MongoDB 4.2, sh.stopBalancer() also disablesauto-splitting for the sharded cluster.

For more information, see theDisable the Balancer procedure.

If necessary, lock one secondary member of each replica set.

If your secondary does not have journaling enabled or itsjournal and data files are on different volumes, you must lockthe secondary’s mongod instance before capturing a backup.

If your secondary has journaling enabled and its journal and datafiles are on the same volume, you may skip this step.

Important

If your deployment requires this step, you must perform it on onesecondary of each shard and one secondary of theconfig server replica set (CSRS).

Ensure that the oplog has sufficient capacity to allow thesesecondaries to catch up to the state of the primaries after finishingthe backup procedure. See Oplog Size for moreinformation.

Lock shard replica set secondary.

For each shard replica set in the sharded cluster, confirm thatthe member has replicated data up to some control point. Toverify, first connect a mongo shell to the shardprimary and perform a write operation with"majority" write concern on a controlcollection:

use config
db.BackupControl.findAndModify(
   {
     query: { _id: 'BackupControlDocument' },
     update: { $inc: { counter : 1 } },
     new: true,
     upsert: true,
     writeConcern: { w: 'majority', wtimeout: 15000 }
   }
);

The operation should return the modified (or inserted) controldocument:

{ "_id" : "BackupControlDocument", "counter" : 1 }

Query the shard secondary member for the returned controldocument. Connect a mongo shell to the shardsecondary to lock and use db.collection.find() to queryfor the control document:

rs.slaveOk();
use config;
 
db.BackupControl.find(
   { "_id" : "BackupControlDocument", "counter" : 1 }
).readConcern('majority');

If the secondary member contains the latest control document,it is safe to lock the member. Otherwise, wait until the membercontains the document or select a different secondary memberthat contains the latest control document.

To lock the secondary member, run db.fsyncLock() onthe member:

db.fsyncLock()

Lock config server replica set secondary.

If locking a secondary of the CSRS, confirm that the member hasreplicated data up to some control point. To verify, first connect amongo shell to the CSRS primary and perform a writeoperation with "majority" write concern on acontrol collection:

use config
db.BackupControl.findAndModify(
   {
     query: { _id: 'BackupControlDocument' },
     update: { $inc: { counter : 1 } },
     new: true,
     upsert: true,
     writeConcern: { w: 'majority', wtimeout: 15000 }
   }
);

The operation should return the modified (or inserted) controldocument:

{ "_id" : "BackupControlDocument", "counter" : 1 }

Query the CSRS secondary member for the returned controldocument. Connect a mongo shell to the CSRS secondaryto lock and use db.collection.find() to query for thecontrol document:

rs.slaveOk();
use config;
 
db.BackupControl.find(
   { "_id" : "BackupControlDocument", "counter" : 1 }
).readConcern('majority');

If the secondary member contains the latest control document, itis safe to lock the member. Otherwise, wait until the membercontains the document or select a different secondary memberthat contains the latest control document.

To lock the secondary member, run db.fsyncLock() onthe member:

db.fsyncLock()

Back up one of the config servers.

Note

Backing up a config server backsup the sharded cluster’s metadata. You only need to back up oneconfig server, as they all hold the same data. Perform this stepagainst the locked CSRS secondary member.

To create a file-system snapshot of the config server, follow theprocedure in Create a Snapshot.

Back up a replica set member for each shard.

If you locked a member of the replica set shards, perform this stepagainst the locked secondary.

You may back up the shards in parallel. For each shard, create asnapshot, using the procedure inBack Up and Restore with Filesystem Snapshots.

Unlock all locked replica set members.

If you locked any mongod instances to capture the backup,unlock them.

To unlock the replica set members, use db.fsyncUnlock()method in the mongo shell.

db.fsyncUnlock()

Enable the balancer.

To re-enable to balancer, connect the mongo shell to amongos instance and runsh.startBalancer().

sh.startBalancer()

Starting in MongoDB 4.2, sh.startBalancer() also enablesauto-splitting for the sharded cluster.