FAQ: MongoDB Diagnostics

This document provides answers to common diagnostic questions andissues.

If you don’t find the answer you’re looking for, checkthe complete list of FAQs or post your question to theMongoDB User Mailing List.

Where can I find information about a mongod process that stopped running unexpectedly?

If mongod shuts down unexpectedly on a UNIX or UNIX-basedplatform, and if mongod fails to log a shutdown or errormessage, then check your system logs for messages pertaining to MongoDB.For example, for logs located in /var/log/messages, use thefollowing commands:

  1. sudo grep mongod /var/log/messages
  2. sudo grep score /var/log/messages

Does TCP keepalive time affect MongoDB Deployments?

If you experience network timeouts or socket errors in communicationbetween clients and servers, or between members of a sharded cluster orreplica set, check the TCP keepalive value for the affected systems.

Many operating systems set this value to 7200 seconds (two hours) bydefault. For MongoDB, you will generally experience better results witha shorter keepalive value, on the order of 120 seconds (twominutes).

If your MongoDB deployment experiences keepalive-related issues, youmust alter the keepalive value on all affected systems. This includesall machines running mongod or mongosprocesses and all machines hosting client processes that connect toMongoDB.

Adjusting the TCP keepalive value:

  • Linux
  • Windows
  • macOS
  • To view the keepalive setting on Linux, use one of the followingcommands:
  1. sysctl net.ipv4.tcp_keepalive_time

Or:

  1. cat /proc/sys/net/ipv4/tcp_keepalive_time

The value is measured in seconds.

Note

Although the setting name includes ipv4, thetcp_keepalive_time value applies to both IPv4 and IPv6.

  • To change the tcpkeepalive_time value, you can use one of thefollowing commands, supplying a _ in seconds:
  1. sudo sysctl -w net.ipv4.tcp_keepalive_time=<value>

Or:

  1. echo <value> | sudo tee /proc/sys/net/ipv4/tcp_keepalive_time

These operations do not persist across system reboots. To persistthe setting, add the following line to /etc/sysctl.conf,supplying a in seconds, and reboot the machine:

  1. net.ipv4.tcp_keepalive_time = <value>

Keepalive values greater than 300 seconds,(5 minutes) will be overridden on mongod andmongos sockets and set to 300 seconds.

  • To view the keepalive setting on Windows, issue the following command:
  1. reg query HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters /v KeepAliveTime

The registry value is not present by default. The system default,used if the value is absent, is 7200000 milliseconds or0x6ddd00 in hexadecimal.

  • To change the KeepAliveTime value, use the following command inan Administrator Command Prompt, where <value> isexpressed in hexadecimal (e.g. 120000 is 0x1d4c0):
  1. reg add HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\ /t REG_DWORD /v KeepAliveTime /d <value>

Windows users should consider the Windows Server Technet Article onKeepAliveTime formore information on setting keepalive for MongoDB deployments onWindows systems. Keepalive values greater than or equal to600000 milliseconds (10 minutes) will be ignored bymongod and mongos.

  • To view the keepalive setting on macOS, issue the following command:
  1. sysctl net.inet.tcp.keepidle

The value is measured in milliseconds.

  • To change the net.inet.tcp.keepidle value, you can use thefollowing command, supplying a in milliseconds:
  1. sudo sysctl net.inet.tcp.keepidle=<value>

This operation does not persist across system reboots, and must beset each time your system reboots. See your operating system’sdocumentation for instructions on setting this value persistently.Keepalive values greater than or equal to 600000 milliseconds(10 minutes) will be ignored by mongod andmongos.

Note

In macOS 10.15 Catalina, Apple no longer allows for configurationof the net.inet.tcp.keepidle option.

You will need to restart mongod and mongosprocesses for new system-wide keepalive settings to take effect.

Why does MongoDB log so many “Connection Accepted” events?

If you see a very large number of connection and re-connection messagesin your MongoDB log, then clients are frequently connecting anddisconnecting to the MongoDB server. This is normal behavior forapplications that do not use request pooling, such as CGI. Considerusing FastCGI, an Apache Module, or some other kind of persistentapplication server to decrease the connection overhead.

If these connections do not impact your performance you can use therun-time quiet option or the command-line option—quiet to suppress these messages from thelog.

What tools are available for monitoring MongoDB?

Starting in version 4.0, MongoDB offers free Cloud monitoring for standalones and replica sets.Free monitoring provides information about your deployment, including:

  • Operation Execution Times
  • Memory Usage
  • CPU Usage
  • Operation Counts

For more information, see Free Monitoring.

The MongoDB Cloud Manager andOps Manager, an on-premise solution available in MongoDBEnterprise Advanced includemonitoring functionality, which collects data from running MongoDBdeployments and provides visualization and alerts based on that data.

For more information, see also the MongoDB Cloud Manager documentation andOps Manager documentation.

A full list of third-party tools is available as part of theMonitoring for MongoDB documentation.

Memory Diagnostics for the WiredTiger Storage Engine

Must my working set size fit RAM?

No.

If the cache does not have enough space to load additional data,WiredTiger evicts pages from the cache to free up space.

Note

The storage.wiredTiger.engineConfig.cacheSizeGB limits the size of the WiredTiger internalcache. The operating system will use the available free memoryfor filesystem cache, which allows the compressed MongoDB datafiles to stay in memory. In addition, the operating system willuse any free RAM to buffer file system blocks and file systemcache.

To accommodate the additional consumers of RAM, you may have todecrease WiredTiger internal cache size.

The default WiredTiger internal cache size value assumes that there is asingle mongod instance per machine. If a single machinecontains multiple MongoDB instances, then you should decrease the setting toaccommodate the other mongodinstances.

If you run mongod in a container (e.g. lxc,cgroups, Docker, etc.) that does not have access to all of theRAM available in a system, you must set storage.wiredTiger.engineConfig.cacheSizeGB to a valueless than the amount of RAM available in the container. The exactamount depends on the other processes running in the container. SeememLimitMB.

To see statistics on the cache and eviction, use theserverStatus command. ThewiredTiger.cache field holds the information onthe cache and eviction.

  1. ...
  2. "wiredTiger" : {
  3. ...
  4. "cache" : {
  5. "tracked dirty bytes in the cache" : <num>,
  6. "bytes currently in the cache" : <num>,
  7. "maximum bytes configured" : <num>,
  8. "bytes read into cache" :<num>,
  9. "bytes written from cache" : <num>,
  10. "pages evicted by application threads" : <num>,
  11. "checkpoint blocked page eviction" : <num>,
  12. "unmodified pages evicted" : <num>,
  13. "page split during eviction deepened the tree" : <num>,
  14. "modified pages evicted" : <num>,
  15. "pages selected for eviction unable to be evicted" : <num>,
  16. "pages evicted because they exceeded the in-memory maximum" : <num>,,
  17. "pages evicted because they had chains of deleted items" : <num>,
  18. "failed eviction of pages that exceeded the in-memory maximum" : <num>,
  19. "hazard pointer blocked page eviction" : <num>,
  20. "internal pages evicted" : <num>,
  21. "maximum page size at eviction" : <num>,
  22. "eviction server candidate queue empty when topping up" : <num>,
  23. "eviction server candidate queue not empty when topping up" : <num>,
  24. "eviction server evicting pages" : <num>,
  25. "eviction server populating queue, but not evicting pages" : <num>,
  26. "eviction server unable to reach eviction goal" : <num>,
  27. "pages split during eviction" : <num>,
  28. "pages walked for eviction" : <num>,
  29. "eviction worker thread evicting pages" : <num>,
  30. "in-memory page splits" : <num>,
  31. "percentage overhead" : <num>,
  32. "tracked dirty pages in the cache" : <num>,
  33. "pages currently held in the cache" : <num>,
  34. "pages read into cache" : <num>,
  35. "pages written from cache" : <num>,
  36. },
  37. ...

For an explanation of some key cache and eviction statistics, suchas wiredTiger.cache.bytes currently in thecache and wiredTiger.cache.tracked dirty bytesin the cache, see wiredTiger.cache.

To adjust the size of the WiredTiger internal cache, seestorage.wiredTiger.engineConfig.cacheSizeGB and—wiredTigerCacheSizeGB. Avoid increasing the WiredTigerinternal cache size above its default value.

How do I calculate how much RAM I need for my application?

With WiredTiger, MongoDB utilizes both the WiredTiger internal cacheand the filesystem cache.

Starting in MongoDB 3.4, the default WiredTiger internal cache size isthe larger of either:

  • 50% of (RAM - 1 GB), or
  • 256 MB.

For example, on a system with a total of 4GB of RAM the WiredTigercache will use 1.5GB of RAM (0.5 (4 GB - 1 GB) = 1.5 GB).Conversely, a system with a total of 1.25 GB of RAM will allocate 256MB to the WiredTiger cache because that is more than half of thetotal RAM minus one gigabyte (0.5 (1.25 GB - 1 GB) = 128 MB < 256 MB).

Note

In some instances, such as when running in a container, the databasecan have memory constraints that are lower than the total systemmemory. In such instances, this memory limit, rather than the totalsystem memory, is used as the maximum RAM available.

To see the memory limit, see hostInfo.system.memLimitMB.

By default, WiredTiger uses Snappy block compression for all collectionsand prefix compression for all indexes. Compression defaults are configurableat a global level and can also be set on a per-collection and per-indexbasis during collection and index creation.

Different representations are used for data in the WiredTiger internal cacheversus the on-disk format:

  • Data in the filesystem cache is the same as the on-disk format, includingbenefits of any compression for data files. The filesystem cache is usedby the operating system to reduce disk I/O.
  • Indexes loaded in the WiredTiger internal cache have a different datarepresentation to the on-disk format, but can still take advantage ofindex prefix compression to reduce RAM usage. Index prefix compressiondeduplicates common prefixes from indexed fields.
  • Collection data in the WiredTiger internal cache is uncompressedand uses a different representation from the on-disk format. Blockcompression can provide significant on-disk storage savings, butdata must be uncompressed to be manipulated by the server.

Via the filesystem cache, MongoDB automatically uses all free memorythat is not used by the WiredTiger cache or by other processes.

To adjust the size of the WiredTiger internal cache, seestorage.wiredTiger.engineConfig.cacheSizeGB and—wiredTigerCacheSizeGB. Avoid increasing the WiredTigerinternal cache size above its default value.

Note

The storage.wiredTiger.engineConfig.cacheSizeGB limits the size of the WiredTiger internalcache. The operating system will use the available free memoryfor filesystem cache, which allows the compressed MongoDB datafiles to stay in memory. In addition, the operating system willuse any free RAM to buffer file system blocks and file systemcache.

To accommodate the additional consumers of RAM, you may have todecrease WiredTiger internal cache size.

The default WiredTiger internal cache size value assumes that there is asingle mongod instance per machine. If a single machinecontains multiple MongoDB instances, then you should decrease the setting toaccommodate the other mongodinstances.

If you run mongod in a container (e.g. lxc,cgroups, Docker, etc.) that does not have access to all of theRAM available in a system, you must set storage.wiredTiger.engineConfig.cacheSizeGB to a valueless than the amount of RAM available in the container. The exactamount depends on the other processes running in the container. SeememLimitMB.

To view statistics on the cache and eviction rate, see thewiredTiger.cache fieldreturned from the serverStatus command.

Sharded Cluster Diagnostics

The two most important factors in maintaining a successful sharded cluster are:

You can prevent most issues encountered with sharding by ensuring thatyou choose the best possible shard key for your deployment andensure that you are always adding additional capacity to your clusterwell before the current resources become saturated. Continue readingfor specific issues you may encounter in a production environment.

In a new sharded cluster, why does all data remain on one shard?

Your cluster must have sufficient data for sharding to makesense. Sharding works by migrating chunks between the shards untileach shard has roughly the same number of chunks.

The default chunk size is 64 megabytes. MongoDB will not beginmigrations until the imbalance of chunks in the cluster exceeds themigration threshold. Thisbehavior helps prevent unnecessary chunk migrations, which can degradethe performance of your cluster as a whole.

If you have just deployed a sharded cluster, make sure that you haveenough data to make sharding effective. If you do not have sufficientdata to create more than eight 64 megabyte chunks, then all data willremain on one shard. Either lower the chunk size setting, or add more data to the cluster.

As a related problem, the system will split chunks only oninserts or updates, which means that if you configure sharding and do notcontinue to issue insert and update operations, the database will notcreate any chunks. You can either wait until your application insertsdata orsplit chunks manually.

Finally, if your shard key has a low cardinality, MongoDB may not be able to createsufficient splits among the data.

Why would one shard receive a disproportionate amount of traffic in a sharded cluster?

In some situations, a single shard or a subset of the cluster willreceive a disproportionate portion of the traffic and workload. Inalmost all cases this is the result of a shard key that does noteffectively allow write scaling.

It’s also possible that you have “hot chunks.” In this case, you maybe able to solve the problem by splitting and then migrating parts ofthese chunks.

In the worst case, you may have to consider re-sharding your dataand choosing a different shard keyto correct this pattern.

What can prevent a sharded cluster from balancing?

If you have just deployed your sharded cluster, you may want toconsider the troubleshooting suggestions for a new cluster wheredata remains on a single shard.

If the cluster was initially balanced, but later developed an unevendistribution of data, consider the following possible causes:

  • You have deleted or removed a significant amount of data from thecluster. If you have added additional data, it may have adifferent distribution with regards to its shard key.
  • Your shard key has low cardinalityand MongoDB cannot split the chunks any further.
  • Your data set is growing faster than the balancer can distributedata around the cluster. This is uncommon andtypically is the result of:
    • a balancing window thatis too short, given the rate of data growth.
    • an uneven distribution of write operations that requires more datamigration. You may have to choose a different shard key to resolvethis issue.
    • poor network connectivity between shards, which may lead to chunkmigrations that take too long to complete. Investigate yournetwork configuration and interconnections between shards.

Why do chunk migrations affect sharded cluster performance?

If migrations impact your cluster or application’s performance,consider the following options, depending on the nature of the impact:

  • If migrations only interrupt your clusters sporadically, you canlimit the balancing window to prevent balancing activityduring peak hours. Ensure that there is enough time remaining tokeep the data from becoming out of balance again.
  • If the balancer is always migrating chunks to the detriment ofoverall cluster performance:
    • You may want to attempt decreasing the chunk sizeto limit the size of the migration.
    • Your cluster may be over capacity, and you may want to attempt toadd one or two shards tothe cluster to distribute load.It’s also possible that your shard key causes yourapplication to direct all writes to a single shard. This kind ofactivity pattern can require the balancer to migrate most data soon after writingit. Consider redeploying your cluster with a shard key that providesbetter write scaling.