Fault Tolerance

Fault tolerance is the ability of a system to continue operating without interruption despite the failure of one or more components. The most basic production deployment of Consul has 3 server agents and can lose a single server without interruption.

As you continue to use Consul, your circumstances may change. Perhaps a datacenter becomes more business critical or risk management policies change, necessitating an increase in fault tolerance. The sections below discuss options for how to improve Consul’s fault tolerance.

Fault Tolerance in Consul

Consul’s fault tolerance is determined by the configuration of its voting server agents.

Each Consul datacenter depends on a set of Consul voting server agents. The voting servers ensure Consul has a consistent, fault-tolerant state by requiring a majority of voting servers, known as a quorum, to agree upon any state changes. Examples of state changes include: adding or removing services, adding or removing nodes, and changes in service or node health status.

Without a quorum, Consul experiences an outage: it cannot provide most of its capabilities because they rely on the availability of this state information. If Consul has an outage, normal operation can be restored by following the Disaster recovery for Consul clusters guide.

If Consul is deployed with 3 servers, the quorum size is 2. The deployment can lose 1 server and still maintain quorum, so it has a fault tolerance of 1. If Consul is instead deployed with 5 servers, the quorum size increases to 3, so the fault tolerance increases to 2. To learn more about the relationship between the number of servers, quorum, and fault tolerance, refer to the consensus protocol documentation.

Effectively mitigating your risk is more nuanced than just increasing the fault tolerance metric described above. You must consider:

Correlated Risks

Are you protected against correlated risks? Infrastructure-level failures can cause multiple servers to fail at the same time. This means that a single infrastructure-level failure could cause a Consul outage, even if your server-level fault tolerance is 2.

Mitigation Costs

What are the costs of the mitigation? Different mitigation options present different trade-offs for operational complexity, computing cost, and Consul request performance.

Strategies to Increase Fault Tolerance

The following sections explore several options for increasing Consul’s fault tolerance.

HashiCorp recommends all production deployments consider:

Spread Servers Across Infrastructure Availability Zones

The cloud or on-premise infrastructure underlying your Consul datacenter may be split into several “availability zones”. An availability zone is meant to share no points of failure with other zones by:

  • Having power, cooling, and networking systems independent from other zones
  • Being physically distant enough from other zones so that large-scale disruptions such as natural disasters (flooding, earthquakes) are very unlikely to affect multiple zones

Availability zones are available in the regions of most cloud providers and in some on-premise installations. If possible, spread your Consul voting servers across 3 availability zones to protect your Consul datacenter from a single zone-level failure. For example, if deploying 5 Consul servers across 3 availability zones, place no more than 2 servers in each zone. If one zone fails, at most 2 servers are lost and quorum will be maintained by the 3 remaining servers.

To distribute your Consul servers across availability zones, modify your infrastructure configuration with your infrastructure provider. No change is needed to your Consul server’s agent configuration.

Additionally, you should leverage resources that can automatically restore your compute instance, such as autoscaling groups, virtual machine scale sets, or compute engine autoscaler. The autoscaling resources can be customized to re-deploy servers into specific availability zones and ensure the desired numbers of servers are available at all time.

Add More Voting Servers

For most production use cases, we recommend using either 3 or 5 voting servers, yielding a server-level fault tolerance of 1 or 2 respectively.

Even though it would improve fault tolerance, adding voting servers beyond 5 is not recommended because it decreases Consul’s performance— it requires Consul to involve more servers in every state change or consistent read.

Consul Enterprise provides a way to improve fault tolerance without this performance penalty: using backup voting servers to replace lost voters.

EnterpriseImproving Consul Resilience - 图2Enterprise Use Backup Voting Servers to Replace Lost Voters

Consul Enterprise redundancy zones can be used to improve fault tolerance without the performance penalty of increasing the number of voting servers.

Each redundancy zone should be assigned 2 or more Consul servers. If all servers are healthy, only one server per redundancy zone will be an active voter; all other servers will be backup voters. If a zone’s voter is lost, it will be replaced by:

  • A backup voter within the same zone, if any. Otherwise,
  • A backup voter within another zone, if any.

Consul can replace lost voters with backup voters within 30 seconds in most cases. Because this replacement process is not instantaneous, redundancy zones do not improve immediate fault tolerance— the number of healthy voting servers that can fail at once without causing an outage. Instead, redundancy zones improve optimistic fault tolerance: the number of healthy active and back-up voting servers that can fail gradually without causing an outage.

The relationship between these two types of fault tolerance is:

Optimistic fault tolerance = immediate fault tolerance + the number of healthy backup voters

For example, consider a Consul datacenter with 3 redundancy zones and 2 servers per zone. There will be 3 voting servers (1 per zone), meaning a quorum size of 2 and an immediate fault tolerance of 1. There will also be 3 backup voters (1 per zone), each of which increase the optimistic fault tolerance. Therefore, the optimistic fault tolerance is 4. This provides performance similar to a 3 server setup with fault tolerance similar to a 7 server setup.

We recommend associating each Consul redundancy zone with an infrastructure availability zone to also gain the infrastructure-level fault tolerance benefits provided by availability zones. However, Consul redundancy zones can be used even without the backing of infrastructure availability zones.

For more information on redundancy zones, refer to: