Operating Consul at Scale

This page describes how Consul’s architecture impacts its performance with large scale deployments and shares recommendations for operating Consul in production at scale.

Overview

Consul is a distributed service networking system deployed as a centralized set of servers that coordinate network activity using sidecars that are located alongside user workloads. When Consul is used for its service mesh capabilities, servers also generate configurations for Envoy proxies that run alongside service instances. These proxies support service mesh capabilities like end-to-end mTLS and progressive deployments.

Consul can be deployed in either a single datacenter or across multiple datacenters by establishing WAN federation or peering connections. In this context, a datacenter refers to a named environment whose hosts can communicate with low networking latency. Typically, users map a Consul datacenter to a cloud provider region such as AWS us-east-1 or Azure East US.

To ensure consistency and high availability, Consul servers share data using the Raft consensus protocol. When persisting data, Consul uses BoltDB to store Raft logs and a custom file format for state snapshots. For more information, refer to Consul architecture.

General deployment recommendations

This section provides general configuration and monitoring recommendations for operating Consul at scale.

Data plane resiliency

To make service-to-service communication resilient against outages and failures, we recommend spreading multiple service instances for a service across fault domains. Resilient deployments spread services across multiples of the following:

  • Infrastructure-level availability zones
  • Runtime platform instances, such as Kubernetes clusters
  • Consul datacenters

In the event that any individual domain experiences a failure, service failover ensures that healthy instances in other domains remain discoverable. Consul automatically provides service failover between instances within a single admin partition or datacenter.

Service failover across Consul datacenters must be configured in the datacenters before you can use it. Use one of the following methods to configure failover across datacenters:

Control plane resiliency

When a large number services are deployed to a single datacenter, the Consul servers may experience slower network performance. To make the control plane more resilient against slowdowns and outages, limit the size of individual datacenters by spreading deployments across availability zones, runtimes, and datacenters.

Datacenter size

To ensure resiliency, we recommend limiting deployments to a maximum of 5,000 Consul client agents per Consul datacenter. There are two reasons for this recommendation:

  1. Blast radius reduction: When Consul suffers a server outage in a datacenter or region, blast radius refers to the number of Consul clients or dataplanes attached to that datacenter that can no longer communicate as a result. We recommend limiting the total number of clients attached to a single Consul datacenter in order to reduce the size of its blast radius. Even though Consul is able to run clusters with 10,000 or more nodes, it takes longer to bring larger deployments back online after an outage, which impacts time to recovery.
  2. Agent gossip management: Consul agents use the gossip protocol to share membership information in a gossip pool. By default, all client agents in a single Consul datacenter are in a single gossip pool. Whenever an agent joins or leaves the gossip pool, the other agents propagate that event throughout the pool. If a Consul datacenter experiences agent churn, or a consistently high rate of agents joining and leaving a single pool, cluster performance may be affected by gossip messages being generated faster than they can be transmitted. The result is an ever-growing message queue.

To mitigate these risks, we recommend a maximum of 5,000 Consul client agents in a single gossip pool. There are several strategies for making gossip pools smaller:

  1. Run exactly one Consul agent per host in the infrastructure.
  2. Break up the single Consul datacenter into multiple smaller datacenters.
  3. Enterprise users can define network segments to divide the single gossip pool in the Consul datacenter into multiple smaller pools.

If appropriate for your use case, we recommend breaking up a single Consul datacenter into multiple smaller datacenters. Running multiple datacenters reduces your network’s blast radius more than applying network segments.

Be aware that the number 5,000 is a heuristic for deployments. The number of agents you deploy per datacenter is limited by performance, not Consul itself. Because gossip stability risk is determined by the rate of agent churn rather than the number of nodes, a gossip pool with mostly static nodes may be able to operate effectively with more than 5,000 agents. Meanwhile, a gossip pool with highly dynamic agents, such as spot fleet instances and serverless functions where 10% of agents are replaced each day, may need to be smaller than 5,000 agents.

For additional information about the specifc tests we conducted on Consul deployments at scale in order to generate these recommendations, refer to Consul Scale Test Report to Observe Gossip Stability on the HashiCorp blog.

For most use cases, a limit of 5,000 agents is appropriate. When the consul.serf.queue.Intent metric is consistently high, it is an indication that the gossip pool cannot keep up with the sustained level of churn. In this situation, reduce the churn by lowering the number agents per datacenter.

Kubernetes-specific guidance

In Kubernetes, even though it is possible to deploy Consul agents inside pods alongside services running in the same pod, this unsupported deployment pattern has known performance issues at scale. At large volumes, pod registration and deregistration in Kubernetes causes gossip instability that can lead to cascading failures as services are marked unhealthy, resulting in further cluster churn.

In Consul v1.14 and higher, Consul on Kubernetes does not need to run client agents on every node in a cluster for service discovery and service mesh. This deployment configuration lowers Consul’s resource usage in the data plane, but requires additional resources in the control plane to process xDS resources. To learn more, refer to simplified service mesh with Consul Dataplane.

If you use Kubernetes and Consul as a backend for Vault: Use Vault’s integrated storage backend instead of Consul. A runtime dependency conflict prevents Consul dataplanes from being compatible with Vault. If you need to use Consul v1.14 and higher as a backend for Vault in your Kubernetes deployment, create a separate Consul datacenter that is not federated or peered to your other Consul servers. You can size this datacenter according to your needs and use it exclusively for backend storage for Vault.

Consul server deployment recommendations

Consul server agents are an important part of Consul’s architecture. This section summarizes the differences between running managed and self-managed servers, as well as recommendations on the number of servers to run, how to deploy servers across redundancy zones, hardware requirements, and cloud provider integrations.

Consul server runtimes

Consul servers can be deployed on a few different runtimes:

  • HashiCorp Cloud Platform (HCP) Consul (Managed). These Consul servers are deployed in a hosted environment managed by HCP. To get started with HCP Consul servers in Kubernetes or VM deployments, refer to the Deploy HCP Consul tutorial.
  • VMs or bare metal servers (Self-managed). To get started with Consul on VMs or bare metal servers, refer to the Deploy Consul server tutorial. For a full list of configuration options, refer to Agents Overview.
  • Kubernetes (Self-managed). To get started with Consul on Kubernetes, refer to the Deploy Consul on Kubernetes tutorial.
  • Other container environments, including Docker, Rancher, and Mesos (Self-managed).

When operating Consul at scale, self-managed VM or bare metal server deployments offer the most flexibility. Some Consul Enterprise features that can enhance fault tolerance and read scalability, such as redundancy zones and read replicas, are not available to server agents on Kubernetes runtimes. To learn more, refer to Consul Enterprise feature availability by runtime.

Number of Consul servers

Determining the number of Consul servers to deploy on your network has two key considerations:

  1. Fault tolerance: The number of server outages your deployment can tolerate while maintaining quorum. Additional servers increase a network’s fault tolerance.
  2. Performance scalability: To handle more requests, additional servers produce latency and slow the quorum process. Having too many servers impedes your network instead of helping it.

Fault tolerance should determine your initial decision for how many Consul server agents to deploy. Our recommendation for the number of servers to deploy depends on whether you have access to Consul Enterprise redundancy zones:

  • With redundancy zones: Deploy 6 Consul servers across 3 availability zones. This deployment provides the performance of a 3 server deployment with the fault tolerance of a 7 server deployment.
  • Without redundancy zones: Deploy 5 Consul servers across 3 availability zones. All 5 servers should be voting servers, not read replicas.

For more details, refer to Improving Consul Resilience.

Server requirements

To ensure your server nodes are a sufficient size, we recommend reviewing hardware sizing for Consul servers. If your network needs to handle heavy workloads, refer to our recommendations in read-heavy workload sources and solutions and write-heavy workload sources and solutions.

File descriptors

Consul’s agents use network sockets for gossip communication with the other nodes and agents. As a result, servers create file descriptors for connections from clients, connections from other servers, watch handlers, health checks, and log files. For write-heavy clusters, you must increase the size limit for the number of file descriptions from the default value, 1024. We recommend using a number that is two times higher than your expected number of clients in the cluster.

Auto scaling groups

Auto scaling groups (ASGs) are infrastructure associations in cloud providers used to ensure a specific number of replicas are available for a deployment. When using ASGs for Consul servers, there are specific requirements and processes for bootstrapping Raft and maintaining quorum.

We recommend using the bootstrap-expect command-line flag during cluster creation. However, if you spawn new servers to add to a cluster or upgrade servers, do not configure them to automatically bootstrap. If bootstrap-expect is set on these replicas, it is possible for them to create a separate Raft system, which causes a split brain and leads to errors and general cluster instability.

NUMA architecture awareness

Some cloud providers offer extremely large instance sizes with Non-Uniform Memory Access (NUMA) architectures. Because the Go runtime is not NUMA aware, Consul is not NUMA aware. Even though you can run Consul on NUMA architecture, it will not take advantage of the multiprocessing capabilities.

Consistency modes

Consul offers different consistency modes for both its DNS and HTTP APIs.

DNS

We strongly recommend using stale consistency mode for DNS lookups to optimize for performance over consistency when operating at scale. It is enabled by default and configured with dns_config.allow_stale.

We also recommend that you do not configure dns_config.max_stale to limit the staleness of DNS responses, as it may result in a prolonged outage if your Consul servers become overloaded. If bounded result consistency is required by a service, consider modifying the service to use consistent service discovery HTTP API queries instead of DNS lookups.

Avoid using dns_config.use_cache when operating Consul at scale. Because the Consul agent cache allocates memory for each requested route and each allocation can live up to 3 days, severe memory issues may occur. To implement DNS caching, we instead recommend that you configure TTLs for services and nodes to enable the DNS client to cache responses from Consul.

HTTP API

By default, all HTTP API read requests use the default consistency mode unless overridden on a per-request basis. We do not recommend changing the default consistency mode for HTTP API requests.

We also recommend that you do not configure http_config.discovery_max_stale to limit the staleness of HTTP responses.

Resource usage and metrics recommendations

While operating Consul, monitor the CPU load on the Consul server agents and use metrics from agent telemetry to figure out the cause. Procedures for mitigating heavy resource usage depend on whether the load is caused by read operations, write operations, or Consul’s consensus protocol.

Read-heavy workload sources and solutions

The highest CPU load usually belongs to the current leader. If the CPU load is high, request load is likely a major contributor. Check the following server health metrics:

  • consul.rpc.* - Traditional RPC metrics. The most relevant metrics for understanding server CPU load in read-heavy workloads are consul.rpc.query and consul.rpc.queries_blocking.
  • consul.grpc.server.* - Metrics for the number of streams being processed by the server.
  • consul.xds.server.* - Metrics for the Envoy xDS resources being processed by the server. In Consul v1.14 and higher, these metrics have the potential to become a significant source of read load. Refer to Consul dataplanes for more information.

Depending on your needs, choose one of the following strategies to mitigate server CPU load:

  • The fastest mitigation strategy is to vertically scale servers. However, this strategy increases compute costs and does not scale indefinitely.
  • The most effective long term mitigation strategy is to use stale consistency mode for as many read requests as possible. In Consul v1.12 and higher, operators can use the consul.rpc.server.call metric to identify the most frequent type of read requests made to the Consul servers. Cross reference the results with each endpoint’s HTTP API documentation and use stale consistency for endpoints that support it.
  • If most read requests already use stale consistency mode and you still need to reduce your request load, add more non-voting servers to your deployment. You can use either redundancy zones or read replicas to scale reads without impacting write latency. We recommend adding more servers to redundancy zones because they improve both fault tolerance and stale read scalability.
  • In Consul v1.14 and higher, servers handle Envoy XDS streams for Consul Dataplane deployments in stale consistency mode. As a result, server consistency mode is not configurable. Use the consul.xds.server.* metrics to identify issues related to XDS streams.

Write-heavy workload sources and solutions

Consul is write-limited by disk I/O. For write-heavy workloads, we recommend using NVMe disks.

As a starting point, you should make sure your hardware meets the requirements for large size server clusters, which has 7500+ IOps and 250+ MB/s disk throughput. IOps should be around 5 to 10 times the expected write rate. Conduct further analysis around disk sizing and your expected write rates to understand your network’s specific needs.

If you use network storage, such as AWS EBS, we recommend provisioned I/O volumes. While general purpose volumes function properly, their burstable IOps make it harder to capacity plan. A small peak in writes may not trigger alerts, but as usage grows you may reach a point where the burst limit runs out and workload performance worsens.

For more information, refer to the server performance read/write tuning.

Raft database performance sources and solutions

Consul servers use the Raft consensus protocol to maintain a consistent and fault-tolerant state. Raft stores most Consul data in a MemDB database, which is an in-memory database with indexing. In order to tolerate restarts and power outages, Consul writes Raft logs to disk using BoltDB. Refer to Agent telemetry for more information on metrics for detecting write health.

To monitor overall transaction performance, check for spikes in the Transaction timing metrics. You can also use the Raft replication capacity issues metrics to monitor Raft log snapshots and restores, as spikes and longer durations can be symptoms of overall write and disk contention issues.

In Consul v1.11 and higher, you can also monitor Raft performance with the consul.raft.boltdb.* metrics. We recommend monitoring consul.raft.boltdb.storeLogs for increased activity above normal operating patterns.

Refer to Consul agent telemetry for more information on agent metrics and how to use them.

Raft database size

Raft writes logs to BoltDB, which is designed as a single grow-only file. As a result, if you add 1GB of log entries and then you take a snapshot, only a small number of recent log entries may appear in the file. However, the actual file on disk never shrinks smaller than the 1GB size it grew.

If you need to reclaim disk space, use the bbolt CLI to copy the data to a new database and repoint to the new database in the process. However, be aware that the bbolt compact command requires the database to be offline while being pointed to the new database.

In many cases, including in large clusters, disk space is not a primary concern because Raft logs rarely grow larger than a small number of GiB. However, an inflated file with lots of free space significantly degrades write performance overall due to freelist management.

After they are written to disk, Raft logs are eventually captured in a snapshot and log nodes are removed from BoltDB. BoltDB keeps track of the pages for the removed nodes in its freelist. BoltDB also writes this freelist to disk every time there is a Raft write. When the Raft log grows large quickly and then gets truncated, the size of the freelist can become very large. In the worst case reported to us, the freelist was over 10MB. When this large freelist is written to disk on every Raft commit, the result is a large write amplification for what should be a small Raft commit.

To figure out if a Consul server’s disk performance issues are the result of BoldDB’s freelist, try the following strategies:

  • Compare network bandwidth inbound to the server against disk write bandwidth. If disk write bandwidth is greater than or equal to 5 times the inbound network bandwidth, the disks are likely experiencing freelist management performance issues. While BoltDB freelist may cause problems at ratios lower than 5 to 1, high write bandwidth to inbound bandwidth ratios are a reliable indicator that BoltDB freelist is causing a problem.
  • Use the consul.raft.leader.dispatchLog metric to get information about how long it takes to write a batch of logs to disk.
  • In Consul v1.13 and higher, you can use Raft thread saturation metrics to figure out if Raft is experiencing back pressure and is unable to accept new work due disk limitations.

In Consul v1.11 and higher, you can prevent BoltDB from writing the freelist to disk by setting raftboltdb.NoFreelistSync to true. This setting causes BoltDB to retain the freelist in memory instead. However, be aware that when BoltDB restarts, it needs to scan the database file to manually create the freelist. Small delays in startup may occur. On a fast disk, we measured these delays at the order of tens of seconds for a raft.db file that was 5GiB in size with only 250MiB of used pages.

In general, set raftboltdb.NoFreelistSync to true to produce the following effects:

  • Reduce the amount of data written to disk
  • Increase the amount of time it takes to load the raft.db file on startup

We recommend operators optimize networks according to their individual concerns. For example, if your server runs into disk performance issues but Consul servers do not restart often, setting raftboltdb.NoFreelistSync to true may solve your problems. However, the same action causes issues for deployments with large database files and frequent server restarts.

Raft snapshots

Each state change produces a Raft log entry, and each Consul server receives the same sequence of log entries, which results in servers sharing the same state. The sequence of Raft logs is periodically compacted by the leader into a snapshot of state history. These snapshots are internal to Raft and are not the same as the snapshots generated through Consul’s API, although they contain the same data. Raft snapshots are stored in the server’s data directory in the raft/ folder, alongside the logs in raft.db.

When you add a new Consul server, it must catch up to the current state. It receives the latest snapshot from the leader followed by the sequence of logs between that snapshot and the leader’s current state. Each Raft log has a sequence number and each snapshot contains the last sequence number included in the snapshot. A combination of write-heavy workloads, a large state, congested networks, or busy servers makes it possible for new servers to struggle to catch up to the current state before the next log they need from the leader has already been truncated. The result is a snapshot install loop.

For example, if snapshot A on the leader has an index of 99 and the current index is 150, then when a new server comes online the leader streams snapshot A to the new server for it to restore. However, this snapshot only enables the new server to catch up to index 99. Not only does the new server still need to catch up to index 150, but the leader continued to commit Raft logs in the meantime.

When the leader takes snapshot B at index 199, it truncates the logs that accumulated between snapshot A and snapshot B, which means it truncates Raft logs with indexes between 100 and 199.

Because the new server restored snapshot A, the new server has a current index of 99. It requests logs 100 to 150 because index 150 was the current index when it started the replication restore process. At this point, the leader recognizes that it only has logs 200 and higher, and does not have logs for indexes 100 to 150. The leader determines that the new server’s state is stale and starts the process over by sending the new server the latest snapshot, snapshot B.

Consul keeps a configurable number of Raft trailing logs to prevent the snapshot install loop from repeating. The trailing logs are the last logs that went into the snapshot, and the new server can more easily catch up to the current state using these logs. The default Raft trailing logs configuration value is suitable for most deployments.

In Consul v1.10 and higher, operators can try to prevent a snapshot install loop by monitoring and comparing Consul servers’ consul.raft.rpc.installSnapshot and consul.raft.leader.oldestLogAge timing metrics. Monitor these metrics for the following situations:

  • After truncation, the lowest number on consul.raft.leader.oldestLogAge should always be at least two times higher than the lowest number for consul.raft.rpc.installSnapshot.
  • If these metrics are too close, increase the number of Raft trailing logs, which increases consul.raft.leader.oldestLogAge. Do not set the Raft trailing logs higher than necessary, as it can negatively affect write throughput and latency.

For more information, refer to Raft Replication Capacity Issues.

Performance considerations for specific use cases

This section provides configuration and monitoring recommendations for Consul deployments according to the features you prioritize and their use cases.

Service discovery

To optimize performance for service discovery, we recommend deploying multiple small clusters with consistent numbers of service instances and watches.

Several factors influence Consul performance at scale when used primarily for its service discovery and health check features. The factors you have control over include:

  • The overall number of registered service instances
  • The use of stale reads for DNS queries
  • The number of entities, such as Consul client agents or dataplane components, that are monitoring Consul for changes in a service’s instances, including registration and health status. When any service change occurs, all of those entities incur a computational cost because they must process the state change and reconcile it with previously known data for the service. In addition, the Consul server agents also incur a computational cost when sending these updates.
  • Number of watches monitoring for changes to a service.
  • Rate of catalog updates, which is affected by the following events:
    • A service instance’s health check status changes
    • A service instance’s node loses connectivity to Consul servers
    • The contents of the service definition file changes
    • Service instances are registered or deregistered
    • Orchestrators such as Kubernetes or Nomad move a service to a new node

These factors can occur in combination with one another. Overall, the amount of work the servers complete for service discovery is the product of these factors:

  • Data size, which changes as the number of services and service instances increases
  • The catalog update rate
  • The number of active watches

Because it is typical for these factors to increase in number as clusters grow, the CPU and network resources the servers require to distribute updates may eventually exceed linear growth.

In situations where you can’t run a Consul client agent alongside the service instance you want to register with Consul, such as instances hosted externally or on legacy infrastructure, we recommend using Consul ESM.

Consul ESM enables health checks and monitoring for external services. When using Consul ESM, we recommend running multiple instances to ensure redundancy.

Service mesh

Because Consul’s service mesh uses service discovery subsystems, service mesh performance is also optimized by deploying multiple small clusters with consistent numbers of service instances and watches. Service mesh performance is influenced by the following additional factors:

  • The transparent proxy feature causes client agents to listen for service instance updates across all services instead of a subset. To prevent performance issues, we recommend that you do not use the permissive intention, default: allow, with the transparent proxy feature. When combined, every service instance update propagates to every proxy, which causes additional server load.
  • When you use the built-in service mesh CA provider, Consul leaders are responsible for signing certificates used for mTLS across the service mesh. The impact on CPU utilization depends on the total number of service instances and configured certificate TTLs. You can use the CA provider configuration options to control the number of requests a server processes. We recommend adjusting csr_max_concurrent and csr_max_per_second to suit your environment.

K/V store

While the K/V store in Consul has some similarities to object stores we recommend that you do not use it as a primary application data store.

When using Consul’s K/V store for application configuration and metadata, we recommend the following to optimize performance:

  • Values must be below 512 KB and transactions should be below 64 operations.
  • The keyspace must be well bound. While 10,000 keys may not affect performance, millions of keys are more likely to cause performance issues.
  • Total data size must fit in memory, with additional room for indexes. We recommend that the in-memory size is 3 times the raw key value size.
  • Total data size should remain below 1 GB. Larger snapshots are possible on suitably fast hardware, but they significantly increase recovery times and the operational complexity needed for replication. We recommend limiting data size to keep the cluster healthy and able to recover during maintenance and outages.
  • The K/V store is optimized for reading. To know when you need to make changes to server resources and capacity, we recommend carefully monitoring update rates after they exceed more than a hundred updates per second across the cluster.
  • We recommend that you do not use the K/V store as a general purpose database or object store.

In addition, we recommend that you do not use the blocking query mechanism to listen for updates when your K/V store’s update rate is high. When a K/V result is updated too fast, blocking query loops degrade into busy loops. These loops consume excessive client CPU and cause high server load until appropriately throttled. Watching large key prefixes is unlikely to solve the issue because returning the entire key prefix every time it updates can quickly consume a lot of bandwidth.

Backend for Vault

At scale, using Consul as a backend for Vault results in increased memory and CPU utilization on Consul servers. It also produces unbounded growth in Consul’s data persistence layer that is proportional to both the amount of data being stored in Vault and the rate the data is updated.

In situations where Consul handles large amounts of data and has high write throughput, we recommend adding monitoring for the capacity and health of raft replication on servers. If the server experiences heavy load when the size of its stored data is large enough, a follower may be unable to catch up on replication and become a voter after restarting. This situation occurs when the time it takes for a server to restore from disk takes longer than it takes for the leader to write a new snapshot and truncate its logs. Refer to Raft snapshots for more information.

Vault v1.4 and higher provides integrated storage as its recommended storage option. If you currently use Consul as a storage backend for Vault, we recommend switching to integrated storage. For a comparison between Vault’s integrated storage and Consul as a backend for Vault, refer to storage backends in the Vault documentation. For detailed guidance on migrating the Vault backend from Consul to Vault’s integrated storage, refer to the storage migration tutorial. Integrated storage improves resiliency by preventing a Consul outage from also affecting Vault functionality.