Fault Tolerance

AttentionThis page documents an earlier version. Go to the latest (v2.1)version.

YugabyteDB can automatically handle failures and therefore provides high availability for both Redis as well as Cassandra tables. In this tutorial, we will look at how fault tolerance is achieved for Cassandra, but the same steps would work for Redis tables as well. Except for the Kubernetes example, we will create these tables with a replication factor = 5 that allows a fault tolerance of 2. We will then insert some data through one of the nodes, and query the data from another node. We will then simulate two node failures (one by one) and make sure we are able to successfully query and write data after each of these failures.

If you haven’t installed YugabyteDB yet, do so first by following the Quick Start guide.

1. Setup - create universe and table

If you have a previously running local universe, destroy it using the following.

  1. $ ./bin/yb-ctl destroy

Start a new local universe with replication factor 5.

  1. $ ./bin/yb-ctl --replication_factor 5 create

Connect to cqlsh on node 1.

  1. $ ./bin/cqlsh 127.0.0.1
  1. Connected to local cluster at 127.0.0.1:9042.
  2. [cqlsh 5.0.1 | Cassandra 3.9-SNAPSHOT | CQL spec 3.4.2 | Native protocol v4]
  3. Use HELP for help.
  4. cqlsh>

Create a CQL keyspace and a table.

  1. cqlsh> CREATE KEYSPACE users;
  1. cqlsh> CREATE TABLE users.profile (id bigint PRIMARY KEY,
  2. email text,
  3. password text,
  4. profile frozen<map<text, text>>);

2. Insert data through node 1

Now insert some data by typing the following into cqlsh shell we joined above.

  1. cqlsh> INSERT INTO users.profile (id, email, password, profile) VALUES
  2. (1000, '[email protected]', 'licensed2Kill',
  3. {'firstname': 'James', 'lastname': 'Bond', 'nickname': '007'}
  4. );
  1. cqlsh> INSERT INTO users.profile (id, email, password, profile) VALUES
  2. (2000, '[email protected]', 'itsElementary',
  3. {'firstname': 'Sherlock', 'lastname': 'Holmes'}
  4. );

Query all the rows.

  1. cqlsh> SELECT email, profile FROM users.profile;
  1. email | profile
  2. ------------------------------+---------------------------------------------------------------
  3. [email protected] | {'firstname': 'James', 'lastname': 'Bond', 'nickname': '007'}
  4. [email protected] | {'firstname': 'Sherlock', 'lastname': 'Holmes'}
  5. (2 rows)

Step 3. Read data through another node

Let us now query the data from node 5.

  1. $ ./bin/cqlsh 127.0.0.5
  1. cqlsh> SELECT email, profile FROM users.profile;
  1. email | profile
  2. ------------------------------+---------------------------------------------------------------
  3. [email protected] | {'firstname': 'James', 'lastname': 'Bond', 'nickname': '007'}
  4. [email protected] | {'firstname': 'Sherlock', 'lastname': 'Holmes'}
  5. (2 rows)

4. Verify that one node failure has no impact

We have 5 nodes in this universe. You can verify this by running the following.

  1. $ ./bin/yb-ctl status
  1. ...
  2. 2017-11-19 23:20:35,029 INFO: Server is running: type=tserver, node_id=1, ...
  3. 2017-11-19 23:20:35,061 INFO: Server is running: type=tserver, node_id=2, ...
  4. 2017-11-19 23:20:35,094 INFO: Server is running: type=tserver, node_id=3, ...
  5. 2017-11-19 23:20:35,128 INFO: Server is running: type=tserver, node_id=4, ...
  6. 2017-11-19 23:20:35,155 INFO: Server is running: type=tserver, node_id=5, ...

Let us simulate a node failure by removing node 5.

  1. $ ./bin/yb-ctl remove_node 5

Now running the status command should show only 4 nodes:

  1. $ ./bin/yb-ctl status
  1. ...
  2. 2017-11-19 23:20:35,029 INFO: Server is running: type=tserver, node_id=1, ...
  3. 2017-11-19 23:20:35,061 INFO: Server is running: type=tserver, node_id=2, ...
  4. 2017-11-19 23:20:35,094 INFO: Server is running: type=tserver, node_id=3, ...
  5. 2017-11-19 23:20:35,128 INFO: Server is running: type=tserver, node_id=4, ...
  6. 2017-11-19 23:22:12,997 INFO: Server type=tserver node_id=5 is not running

Now connect to node 4.

  1. $ ./bin/cqlsh 127.0.0.4

Let us insert some data.

  1. cqlsh> INSERT INTO users.profile (id, email, password, profile) VALUES
  2. (3000, '[email protected]', 'imGroovy',
  3. {'firstname': 'Austin', 'lastname': 'Powers'});

Now query the data.

  1. cqlsh> SELECT email, profile FROM users.profile;
  1. email | profile
  2. ------------------------------+---------------------------------------------------------------
  3. [email protected] | {'firstname': 'James', 'lastname': 'Bond', 'nickname': '007'}
  4. [email protected] | {'firstname': 'Sherlock', 'lastname': 'Holmes'}
  5. [email protected] | {'firstname': 'Austin', 'lastname': 'Powers'}
  6. (3 rows)

5. Verify that second node failure has no impact

This cluster was created with replication factor 5 and hence needs only 3 replicas to make consensus. Therefore, it is resilient to 2 failures without any data loss. Let us simulate another node failure.

  1. $ ./bin/yb-ctl remove_node 1

We can check the status to verify:

  1. $ ./bin/yb-ctl status
  1. ...
  2. 2017-11-19 23:31:02,183 INFO: Server type=tserver node_id=1 is not running
  3. 2017-11-19 23:31:02,217 INFO: Server is running: type=tserver, node_id=2, ...
  4. 2017-11-19 23:31:02,245 INFO: Server is running: type=tserver, node_id=3, ...
  5. 2017-11-19 23:31:02,278 INFO: Server is running: type=tserver, node_id=4, ...
  6. 2017-11-19 23:31:02,308 INFO: Server type=tserver node_id=5 is not running

Now let us connect to node 2.

  1. $ ./bin/cqlsh 127.0.0.2

Insert some data.

  1. cqlsh> INSERT INTO users.profile (id, email, password, profile) VALUES
  2. (4000, '[email protected]', 'iCanFly',
  3. {'firstname': 'Clark', 'lastname': 'Kent'});

Run the query.

  1. cqlsh> SELECT email, profile FROM users.profile;
  1. email | profile
  2. ------------------------------+---------------------------------------------------------------
  3. [email protected] | {'firstname': 'Clark', 'lastname': 'Kent'}
  4. [email protected] | {'firstname': 'James', 'lastname': 'Bond', 'nickname': '007'}
  5. [email protected] | {'firstname': 'Sherlock', 'lastname': 'Holmes'}
  6. [email protected] | {'firstname': 'Austin', 'lastname': 'Powers'}
  7. (4 rows)

6. Clean up (optional)

Optionally, you can shutdown the local cluster created in Step 1.

  1. $ ./bin/yb-ctl destroy

1. Setup - create universe and table

If you have a previously running local universe, destroy it using the following.

  1. $ ./bin/yb-ctl destroy

Start a new local universe with replication factor 5.

  1. $ ./bin/yb-ctl --replication_factor 5 create

Connect to cqlsh on node 1.

  1. $ ./bin/cqlsh 127.0.0.1
  1. Connected to local cluster at 127.0.0.1:9042.
  2. [cqlsh 5.0.1 | Cassandra 3.9-SNAPSHOT | CQL spec 3.4.2 | Native protocol v4]
  3. Use HELP for help.
  4. cqlsh>

Create a CQL keyspace and a table.

  1. cqlsh> CREATE KEYSPACE users;
  1. cqlsh> CREATE TABLE users.profile (id bigint PRIMARY KEY,
  2. email text,
  3. password text,
  4. profile frozen<map<text, text>>);

2. Insert data through node 1

Now insert some data by typing the following into cqlsh shell we joined above.

  1. cqlsh> INSERT INTO users.profile (id, email, password, profile) VALUES
  2. (1000, '[email protected]', 'licensed2Kill',
  3. {'firstname': 'James', 'lastname': 'Bond', 'nickname': '007'}
  4. );
  1. cqlsh> INSERT INTO users.profile (id, email, password, profile) VALUES
  2. (2000, '[email protected]', 'itsElementary',
  3. {'firstname': 'Sherlock', 'lastname': 'Holmes'}
  4. );

Query all the rows.

  1. cqlsh> SELECT email, profile FROM users.profile;
  1. email | profile
  2. ------------------------------+---------------------------------------------------------------
  3. [email protected] | {'firstname': 'James', 'lastname': 'Bond', 'nickname': '007'}
  4. [email protected] | {'firstname': 'Sherlock', 'lastname': 'Holmes'}
  5. (2 rows)

Step 3. Read data through another node

Let us now query the data from node 5.

  1. $ ./bin/cqlsh 127.0.0.5
  1. cqlsh> SELECT email, profile FROM users.profile;
  1. email | profile
  2. ------------------------------+---------------------------------------------------------------
  3. [email protected] | {'firstname': 'James', 'lastname': 'Bond', 'nickname': '007'}
  4. [email protected] | {'firstname': 'Sherlock', 'lastname': 'Holmes'}
  5. (2 rows)

4. Verify that one node failure has no impact

We have 5 nodes in this universe. You can verify this by running the following.

  1. $ ./bin/yb-ctl status
  1. ...
  2. 2017-11-19 23:20:35,029 INFO: Server is running: type=tserver, node_id=1, ...
  3. 2017-11-19 23:20:35,061 INFO: Server is running: type=tserver, node_id=2, ...
  4. 2017-11-19 23:20:35,094 INFO: Server is running: type=tserver, node_id=3, ...
  5. 2017-11-19 23:20:35,128 INFO: Server is running: type=tserver, node_id=4, ...
  6. 2017-11-19 23:20:35,155 INFO: Server is running: type=tserver, node_id=5, ...

Let us simulate a node failure by removing node 5.

  1. $ ./bin/yb-ctl remove_node 5

Now running the status command should show only 4 nodes:

  1. $ ./bin/yb-ctl status
  1. ...
  2. 2017-11-19 23:20:35,029 INFO: Server is running: type=tserver, node_id=1, ...
  3. 2017-11-19 23:20:35,061 INFO: Server is running: type=tserver, node_id=2, ...
  4. 2017-11-19 23:20:35,094 INFO: Server is running: type=tserver, node_id=3, ...
  5. 2017-11-19 23:20:35,128 INFO: Server is running: type=tserver, node_id=4, ...
  6. 2017-11-19 23:22:12,997 INFO: Server type=tserver node_id=5 is not running

Now connect to node 4.

  1. $ ./bin/cqlsh 127.0.0.4

Let us insert some data.

  1. cqlsh> INSERT INTO users.profile (id, email, password, profile) VALUES
  2. (3000, '[email protected]', 'imGroovy',
  3. {'firstname': 'Austin', 'lastname': 'Powers'});

Now query the data.

  1. cqlsh> SELECT email, profile FROM users.profile;
  1. email | profile
  2. ------------------------------+---------------------------------------------------------------
  3. [email protected] | {'firstname': 'James', 'lastname': 'Bond', 'nickname': '007'}
  4. [email protected] | {'firstname': 'Sherlock', 'lastname': 'Holmes'}
  5. [email protected] | {'firstname': 'Austin', 'lastname': 'Powers'}
  6. (3 rows)

5. Verify that second node failure has no impact

This cluster was created with replication factor 5 and hence needs only 3 replicas to make consensus. Therefore, it is resilient to 2 failures without any data loss. Let us simulate another node failure.

  1. $ ./bin/yb-ctl remove_node 1

We can check the status to verify:

  1. $ ./bin/yb-ctl status
  1. ...
  2. 2017-11-19 23:31:02,183 INFO: Server type=tserver node_id=1 is not running
  3. 2017-11-19 23:31:02,217 INFO: Server is running: type=tserver, node_id=2, ...
  4. 2017-11-19 23:31:02,245 INFO: Server is running: type=tserver, node_id=3, ...
  5. 2017-11-19 23:31:02,278 INFO: Server is running: type=tserver, node_id=4, ...
  6. 2017-11-19 23:31:02,308 INFO: Server type=tserver node_id=5 is not running

Now let us connect to node 2.

  1. $ ./bin/cqlsh 127.0.0.2

Insert some data.

  1. cqlsh> INSERT INTO users.profile (id, email, password, profile) VALUES
  2. (4000, '[email protected]', 'iCanFly',
  3. {'firstname': 'Clark', 'lastname': 'Kent'});

Run the query.

  1. cqlsh> SELECT email, profile FROM users.profile;
  1. email | profile
  2. ------------------------------+---------------------------------------------------------------
  3. [email protected] | {'firstname': 'Clark', 'lastname': 'Kent'}
  4. [email protected] | {'firstname': 'James', 'lastname': 'Bond', 'nickname': '007'}
  5. [email protected] | {'firstname': 'Sherlock', 'lastname': 'Holmes'}
  6. [email protected] | {'firstname': 'Austin', 'lastname': 'Powers'}
  7. (4 rows)

6. Clean up (optional)

Optionally, you can shutdown the local cluster created in Step 1.

  1. $ ./bin/yb-ctl destroy

1. Setup - create universe and table

If you have a previously running local universe, destroy it using the following.

  1. $ ./yb-docker-ctl destroy

Start a new local universe with replication factor 5.

  1. $ ./yb-docker-ctl create --rf 5

Connect to cqlsh on node 1.

  1. $ docker exec -it yb-tserver-n1 /home/yugabyte/bin/cqlsh
  1. Connected to local cluster at 127.0.0.1:9042.
  2. [cqlsh 5.0.1 | Cassandra 3.9-SNAPSHOT | CQL spec 3.4.2 | Native protocol v4]
  3. Use HELP for help.
  4. cqlsh>

Create a Cassandra keyspace and a table.

  1. cqlsh> CREATE KEYSPACE users;
  1. cqlsh> CREATE TABLE users.profile (id bigint PRIMARY KEY,
  2. email text,
  3. password text,
  4. profile frozen<map<text, text>>);

2. Insert data through node 1

Now insert some data by typing the following into cqlsh shell we joined above.

  1. cqlsh> INSERT INTO users.profile (id, email, password, profile) VALUES
  2. (1000, '[email protected]', 'licensed2Kill',
  3. {'firstname': 'James', 'lastname': 'Bond', 'nickname': '007'}
  4. );
  1. cqlsh> INSERT INTO users.profile (id, email, password, profile) VALUES
  2. (2000, '[email protected]', 'itsElementary',
  3. {'firstname': 'Sherlock', 'lastname': 'Holmes'}
  4. );

Query all the rows.

  1. cqlsh> SELECT email, profile FROM users.profile;
  1. email | profile
  2. ------------------------------+---------------------------------------------------------------
  3. [email protected] | {'firstname': 'James', 'lastname': 'Bond', 'nickname': '007'}
  4. [email protected] | {'firstname': 'Sherlock', 'lastname': 'Holmes'}
  5. (2 rows)

3. Read data through another node

Let us now query the data from node 5.

  1. $ docker exec -it yb-tserver-n5 /home/yugabyte/bin/cqlsh
  1. cqlsh> SELECT email, profile FROM users.profile;
  1. email | profile
  2. ------------------------------+---------------------------------------------------------------
  3. [email protected] | {'firstname': 'James', 'lastname': 'Bond', 'nickname': '007'}
  4. [email protected] | {'firstname': 'Sherlock', 'lastname': 'Holmes'}
  5. (2 rows)

4. Verify that one node failure has no impact

We have 5 nodes in this universe. You can verify this by running the following.

  1. $ ./yb-docker-ctl status

Let us simulate node 5 failure by doing the following.

  1. $ ./yb-docker-ctl remove_node 5

Now running the status command should show only 4 nodes:

  1. $ ./yb-docker-ctl status

Now connect to node 4.

  1. $ docker exec -it yb-tserver-n4 /home/yugabyte/bin/cqlsh

Let us insert some data.

  1. cqlsh> INSERT INTO users.profile (id, email, password, profile) VALUES
  2. (3000, '[email protected]', 'imGroovy',
  3. {'firstname': 'Austin', 'lastname': 'Powers'});

Now query the data.

  1. cqlsh> SELECT email, profile FROM users.profile;
  1. email | profile
  2. ------------------------------+---------------------------------------------------------------
  3. [email protected] | {'firstname': 'James', 'lastname': 'Bond', 'nickname': '007'}
  4. [email protected] | {'firstname': 'Sherlock', 'lastname': 'Holmes'}
  5. [email protected] | {'firstname': 'Austin', 'lastname': 'Powers'}
  6. (3 rows)

5. Verify that second node failure has no impact

This cluster was created with replication factor 5 and hence needs only 3 replicas to make consensus. Therefore, it is resilient to 2 failures without any data loss. Let us simulate another node failure.

  1. $ ./yb-docker-ctl remove_node 1

We can check the status to verify:

  1. $ ./yb-docker-ctl status

Now let us connect to node 2.

  1. $ docker exec -it yb-tserver-n2 /home/yugabyte/bin/cqlsh

Insert some data.

  1. cqlsh> INSERT INTO users.profile (id, email, password, profile) VALUES
  2. (4000, '[email protected]', 'iCanFly',
  3. {'firstname': 'Clark', 'lastname': 'Kent'});

Run the query.

  1. cqlsh> SELECT email, profile FROM users.profile;
  1. email | profile
  2. ------------------------------+---------------------------------------------------------------
  3. [email protected] | {'firstname': 'Clark', 'lastname': 'Kent'}
  4. [email protected] | {'firstname': 'James', 'lastname': 'Bond', 'nickname': '007'}
  5. [email protected] | {'firstname': 'Sherlock', 'lastname': 'Holmes'}
  6. [email protected] | {'firstname': 'Austin', 'lastname': 'Powers'}
  7. (4 rows)

Step 6. Clean up (optional)

Optionally, you can shutdown the local cluster created in Step 1.

  1. $ ./yb-docker-ctl destroy

1. Setup - create universe and table

If you have a previously running local universe, destroy it using the following.

  1. $ kubectl delete -f yugabyte-statefulset.yaml

Start a new local cluster - by default, this will create a 3 node universe with a replication factor of 3.

  1. $ kubectl apply -f yugabyte-statefulset.yaml

Check the Kubernetes dashboard to see the 3 yb-tserver and 3 yb-master pods representing the 3 nodes of the cluster.

  1. $ minikube dashboard

Kubernetes Dashboard

Connect to cqlsh on node 1.

  1. $ kubectl exec -it yb-tserver-0 /home/yugabyte/bin/cqlsh
  1. Connected to local cluster at 127.0.0.1:9042.
  2. [cqlsh 5.0.1 | Cassandra 3.9-SNAPSHOT | CQL spec 3.4.2 | Native protocol v4]
  3. Use HELP for help.
  4. cqlsh>

Create a Cassandra keyspace and a table.

  1. cqlsh> CREATE KEYSPACE users;
  1. cqlsh> CREATE TABLE users.profile (id bigint PRIMARY KEY,
  2. email text,
  3. password text,
  4. profile frozen<map<text, text>>);

2. Insert data through node 1

Now insert some data by typing the following into cqlsh shell we joined above.

  1. cqlsh> INSERT INTO users.profile (id, email, password, profile) VALUES
  2. (1000, '[email protected]', 'licensed2Kill',
  3. {'firstname': 'James', 'lastname': 'Bond', 'nickname': '007'}
  4. );
  1. cqlsh> INSERT INTO users.profile (id, email, password, profile) VALUES
  2. (2000, '[email protected]', 'itsElementary',
  3. {'firstname': 'Sherlock', 'lastname': 'Holmes'}
  4. );

Query all the rows.

  1. cqlsh> SELECT email, profile FROM users.profile;
  1. email | profile
  2. ------------------------------+---------------------------------------------------------------
  3. [email protected] | {'firstname': 'James', 'lastname': 'Bond', 'nickname': '007'}
  4. [email protected] | {'firstname': 'Sherlock', 'lastname': 'Holmes'}
  5. (2 rows)

3. Read data through another node

Let us now query the data from node 3.

  1. $ kubectl exec -it yb-tserver-2 /home/yugabyte/bin/cqlsh
  1. cqlsh> SELECT email, profile FROM users.profile;
  1. email | profile
  2. ------------------------------+---------------------------------------------------------------
  3. [email protected] | {'firstname': 'James', 'lastname': 'Bond', 'nickname': '007'}
  4. [email protected] | {'firstname': 'Sherlock', 'lastname': 'Holmes'}
  5. (2 rows)
  1. cqlsh> exit;

Step 4. Verify one node failure has no impact

This cluster was created with replication factor 3 and hence needs only 2 replicas to make consensus. Therefore, it is resilient to 1 failure without any data loss. Let us simulate node 3 failure.

  1. $ kubectl delete pod yb-tserver-2

Now running the status command should would show that the yb-tserver-2 pod is Terminating.

  1. $ kubectl get pods
  1. NAME READY STATUS RESTARTS AGE
  2. yb-master-0 1/1 Running 0 33m
  3. yb-master-1 1/1 Running 0 33m
  4. yb-master-2 1/1 Running 0 33m
  5. yb-tserver-0 1/1 Running 1 33m
  6. yb-tserver-1 1/1 Running 1 33m
  7. yb-tserver-2 1/1 Terminating 0 33m

Now connect to node 2.

  1. $ kubectl exec -it yb-tserver-1 /home/yugabyte/bin/cqlsh

Let us insert some data to ensure that the loss of a node hasn’t impacted the ability of the universe to take writes.

  1. cqlsh> INSERT INTO users.profile (id, email, password, profile) VALUES
  2. (3000, '[email protected]', 'imGroovy',
  3. {'firstname': 'Austin', 'lastname': 'Powers'});

Now query the data. We see that all the data inserted so far is returned and the loss of the node has no impact on data integrity.

  1. cqlsh> SELECT email, profile FROM users.profile;
  1. email | profile
  2. ------------------------------+---------------------------------------------------------------
  3. [email protected] | {'firstname': 'James', 'lastname': 'Bond', 'nickname': '007'}
  4. [email protected] | {'firstname': 'Sherlock', 'lastname': 'Holmes'}
  5. [email protected] | {'firstname': 'Austin', 'lastname': 'Powers'}
  6. (3 rows)

5. Verify that Kubernetes brought back the failed node

We can now check the cluster status to verify that Kubernetes has indeed brought back the yb-tserver-2 node that had failed before. This is because the replica count currently effective in Kubernetes for the yb-tserver StatefulSet is 3 and there were only 2 nodes remaining after 1 node failure.

  1. $ kubectl get pods
  1. NAME READY STATUS RESTARTS AGE
  2. yb-master-0 1/1 Running 0 34m
  3. yb-master-1 1/1 Running 0 34m
  4. yb-master-2 1/1 Running 0 34m
  5. yb-tserver-0 1/1 Running 1 34m
  6. yb-tserver-1 1/1 Running 1 34m
  7. yb-tserver-2 1/1 Running 0 7s

YugabyteDB’s fault tolerance when combined with Kubernetes’s automated operations ensures that planet-scale applications can be run with ease while ensuring extreme data resilience.

6. Clean up (optional)

Optionally, you can shutdown the local cluster created in Step 1.

  1. $ kubectl delete -f yugabyte-statefulset.yaml