Troubleshoot Sharded Clusters

Troubleshoot Sharded Clusters

This page describes common strategies for troubleshootingsharded cluster deployments.

Application Servers or mongos Instances Become Unavailable

If each application server has its own mongos instance, otherapplication servers can continue to access the database. Furthermore,mongos instances do not maintain persistent state, and theycan restart and become unavailable without losing any state or data.When a mongos instance starts, it retrieves a copy of theconfig database and can begin routing queries.

A Single Member Becomes Unavailable in a Shard Replica Set

Replica sets provide high availability forshards. If the unavailable mongod is a primary,then the replica set will elect a newprimary. If the unavailable mongod is asecondary, and it disconnects the primary and secondary willcontinue to hold all data. In a three member replica set, even if asingle member of the set experiences catastrophic failure, two othermembers have full copies of the data. [1]

Always investigate availability interruptions and failures. If a systemis unrecoverable, replace it and create a new member of the replica setas soon as possible to replace the lost redundancy.

[1]	If an unavailable secondary becomes availablewhile it still has current oplog entries, it can catch up to thelatest state of the set using the normal replication process; otherwise, it must perform an initial sync.

All Members of a Shard Become Unavailable

In a sharded cluster, mongod and mongos instancesmonitor the replica sets in the sharded cluster (e.g. shard replicasets, config server replica set).

If all members of a replica set shard are unavailable, all data held inthat shard is unavailable. However, the data on all other shards willremain available, and it is possible to read and write data to theother shards. However, your application must be able to deal withpartial results, and you should investigate the cause of theinterruption and attempt to recover the shard as soon as possible.

A Config Server Replica Set Member Become Unavailable

Replica sets provide high availability forthe config servers. [2] If an unavailable config server is aprimary, then the replica set willelect a new primary.

If the replica set config server loses its primary and cannot elect aprimary, the cluster’s metadata becomes read only. You can still readand write data from the shards, but no chunk migration or chunk splits will occur until a primaryis available.

Note

Distributing replica set members across two data centers providesbenefit over a single data center. In a two data centerdistribution,

If one of the data centers goes down, the data is still availablefor reads unlike a single data center distribution.
If the data center with a minority of the members goes down, thereplica set can still serve write operations as well as readoperations.
However, if the data center with the majority of the members goesdown, the replica set becomes read-only.

If possible, distribute members across at least three data centers.For config server replica sets (CSRS), the best practice is todistribute across three (or more depending on the number of members)centers. If the cost of the third data center is prohibitive, onedistribution possibility is to evenly distribute the data bearingmembers across the two data centers and store the remaining member(either a data bearing member or an arbiter to ensure odd numberof members) in the cloud if your company policy allows.

Note

All config servers must be running and available when you first initiatea sharded cluster.

[2]	Starting in MongoDB 3.4, the use of three mirrored`mongod` instances (SCCC) as config servers is nolonger supported.

Cursor Fails Because of Stale Config Data

A query returns the following warning when one or more of themongos instances has not yet updated its cache of thecluster’s metadata from the config database:

could not initialize cursor across all shards because : stale config detected

This warning should not propagate back to your application. Thewarning will repeat until all the mongos instances refreshtheir caches. To force an instance to refresh its cache, run theflushRouterConfig command.

Shard Keys and Cluster Availability

The most important consideration when choosing a shard keyare:

to ensure that MongoDB will be able to distribute data evenly amongshards, and
to scale writes across the cluster, and
to ensure that mongos can isolate most queries to a specificmongod.

Furthermore:

Each shard should be a replica set, if a specificmongod instance fails, the replica set members will electanother to be primary and continue operation. However, if anentire shard is unreachable or fails for some reason, that data willbe unavailable.
If the shard key allows the mongos to isolate mostoperations to a single shard, then the failure of a single shardwill only render some data unavailable.
If your shard key distributes data required for every operationthroughout the cluster, then the failure of the entire shard willrender the entire cluster unavailable.

In essence, this concern for reliability simply underscores theimportance of choosing a shard key that isolates query operations to asingle shard.

Config Database String Error

Changed in version 3.2.

Starting in MongoDB 3.2, config servers can be deployed as replicasets. The mongos instances for the sharded cluster mustspecify the same config server replica set name but can specifyhostname and port of different members of the replica set.

Starting in 3.4, the use of the deprecated mirrored mongodinstances as config servers (SCCC) is no longer supported. Before youcan upgrade your sharded clusters to 3.4, you must convert your configservers from SCCC to CSRS.

To convert your config servers from SCCC to CSRS, see the MongoDB 3.4manual Upgrade Config Servers to Replica Set.

With earlier versions of MongoDB sharded clusters that use the topologyof three mirrored mongod instances for config servers,mongos instances in a sharded cluster must specify identicalconfigDB string.

Avoid Downtime when Moving Config Servers

Use CNAMEs to identify your config servers to the cluster sothat you can rename and renumber your config servers without downtime.

moveChunk commit failed Error

At the end of a chunk migration, theshard must connect to the config database to update thechunk’s record in the cluster metadata. If the shard fails toconnect to the config database, MongoDB reports the followingerror:

ERROR: moveChunk commit failed: version is at <n>|<nn> instead of
<N>|<NN>" and "ERROR: TERMINATING"

When this happens, the primary member of the shard’s replicaset then terminates to protect data consistency. If a secondarymember can access the config database, data on the shard becomesaccessible again after an election.

The user will need to resolve the chunk migration failureindependently. If you encounter this issue, contact the MongoDBUser Group orMongoDB Support to address this issue.