Network Isolation (Split Brain)

It is possible that if a replicated live or backup server becomes isolated in a network that failover will occur and you will end up with 2 live servers serving messages in a cluster, this we call split brain. There are different configurations you can choose from that will help mitigate this problem

Quorum Voting

Quorum voting is used by both the live and the backup to decide what to do if a replication connection is disconnected. Basically the server will request each live server in the cluster to vote as to whether it thinks the server it is replicating to or from is still alive. You can also configure the time for which the quorum manager will wait for the quorum vote response. The default time is 30 seconds you can configure like so for master and also for the slave:

  1. <ha-policy>
  2. <replication>
  3. <master>
  4. <quorum-vote-wait>12</quorum-vote-wait>
  5. </master>
  6. </replication>
  7. </ha-policy>

This being the case the minimum number of live/backup pairs needed is 3. If less than 3 pairs are used then the only option is to use a Network Pinger which is explained later in this chapter or choose how you want each server to react which the following details:

Backup Voting

By default if a replica loses its replication connection to the live broker it makes a decision as to whether to start or not with a quorum vote. This of course requires that there be at least 3 pairs of live/backup nodes in the cluster. For a 3 node cluster it will start if it gets 2 votes back saying that its live server is no longer available, for 4 nodes this would be 3 votes and so on. When a backup loses connection to the master it will keep voting for a quorum until it either receives a vote allowing it to start or it detects that the master is still live. for the latter it will then restart as a backup. How many votes and how long between each vote the backup should wait is configured like so:

  1. <ha-policy>
  2. <replication>
  3. <slave>
  4. <vote-retries>12</vote-retries>
  5. <vote-retry-wait>5000</vote-retry-wait>
  6. </slave>
  7. </replication>
  8. </ha-policy>

It’s also possible to statically set the quorum size that should be used for the case where the cluster size is known up front, this is done on the Replica Policy like so:

  1. <ha-policy>
  2. <replication>
  3. <slave>
  4. <quorum-size>2</quorum-size>
  5. </slave>
  6. </replication>
  7. </ha-policy>

In this example the quorum size is set to 2 so if you were using a single pair and the backup lost connectivity it would never start.

Live Voting

By default, if the live server loses its replication connection then it will just carry on and wait for a backup to reconnect and start replicating again. In the event of a possible split brain scenario this may mean that the live stays live even though the backup has been activated. It is possible to configure the live server to vote for a quorum if this happens, in this way if the live server doesn’t not receive a majority vote then it will shutdown. This is done by setting the vote-on-replication-failure to true.

  1. <ha-policy>
  2. <replication>
  3. <master>
  4. <vote-on-replication-failure>true</vote-on-replication-failure>
  5. <quorum-size>2</quorum-size>
  6. </master>
  7. </replication>
  8. </ha-policy>

As in the backup policy it is also possible to statically configure the quorum size.

Pinging the network

You may configure one more addresses on the broker.xml that are part of your network topology, that will be pinged through the life cycle of the server.

The server will stop itself until the network is back on such case.

If you execute the create command passing a -ping argument, you will create a default xml that is ready to be used with network checks:

  1. ./artemis create /myDir/myServer --ping 10.0.0.1

This XML part will be added to your broker.xml:

  1. <!--
  2. You can verify the network health of a particular NIC by specifying the <network-check-NIC> element.
  3. <network-check-NIC>theNicName</network-check-NIC>
  4. -->
  5. <!--
  6. Use this to use an HTTP server to validate the network
  7. <network-check-URL-list>http://www.apache.org</network-check-URL-list> -->
  8. <network-check-period>10000</network-check-period>
  9. <network-check-timeout>1000</network-check-timeout>
  10. <!-- this is a comma separated list, no spaces, just DNS or IPs
  11. it should accept IPV6
  12. Warning: Make sure you understand your network topology as this is meant to check if your network is up.
  13. Using IPs that could eventually disappear or be partially visible may defeat the purpose.
  14. You can use a list of multiple IPs, any successful ping will make the server OK to continue running -->
  15. <network-check-list>10.0.0.1</network-check-list>
  16. <!-- use this to customize the ping used for ipv4 addresses -->
  17. <network-check-ping-command>ping -c 1 -t %d %s</network-check-ping-command>
  18. <!-- use this to customize the ping used for ipv addresses -->
  19. <network-check-ping6-command>ping6 -c 1 %2$s</network-check-ping6-command>

Once you lose connectivity towards 10.0.0.1 on the given example, you will see see this output at the server:

  1. 09:49:24,562 WARN [org.apache.activemq.artemis.core.server.NetworkHealthCheck] Ping Address /10.0.0.1 wasn't reacheable
  2. 09:49:36,577 INFO [org.apache.activemq.artemis.core.server.NetworkHealthCheck] Network is unhealthy, stopping service ActiveMQServerImpl::serverUUID=04fd5dd8-b18c-11e6-9efe-6a0001921ad0
  3. 09:49:36,625 INFO [org.apache.activemq.artemis.core.server] AMQ221002: Apache ActiveMQ Artemis Message Broker version 1.6.0 [04fd5dd8-b18c-11e6-9efe-6a0001921ad0] stopped, uptime 14.787 seconds
  4. 09:50:00,653 WARN [org.apache.activemq.artemis.core.server.NetworkHealthCheck] ping: sendto: No route to host
  5. 09:50:10,656 WARN [org.apache.activemq.artemis.core.server.NetworkHealthCheck] Host is down: java.net.ConnectException: Host is down
  6. at java.net.Inet6AddressImpl.isReachable0(Native Method) [rt.jar:1.8.0_73]
  7. at java.net.Inet6AddressImpl.isReachable(Inet6AddressImpl.java:77) [rt.jar:1.8.0_73]
  8. at java.net.InetAddress.isReachable(InetAddress.java:502) [rt.jar:1.8.0_73]
  9. at org.apache.activemq.artemis.core.server.NetworkHealthCheck.check(NetworkHealthCheck.java:295) [artemis-commons-1.6.0-SNAPSHOT.jar:1.6.0-SNAPSHOT]
  10. at org.apache.activemq.artemis.core.server.NetworkHealthCheck.check(NetworkHealthCheck.java:276) [artemis-commons-1.6.0-SNAPSHOT.jar:1.6.0-SNAPSHOT]
  11. at org.apache.activemq.artemis.core.server.NetworkHealthCheck.run(NetworkHealthCheck.java:244) [artemis-commons-1.6.0-SNAPSHOT.jar:1.6.0-SNAPSHOT]
  12. at org.apache.activemq.artemis.core.server.ActiveMQScheduledComponent$2.run(ActiveMQScheduledComponent.java:189) [artemis-commons-1.6.0-SNAPSHOT.jar:1.6.0-SNAPSHOT]
  13. at org.apache.activemq.artemis.core.server.ActiveMQScheduledComponent$3.run(ActiveMQScheduledComponent.java:199) [artemis-commons-1.6.0-SNAPSHOT.jar:1.6.0-SNAPSHOT]
  14. at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [rt.jar:1.8.0_73]
  15. at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [rt.jar:1.8.0_73]
  16. at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [rt.jar:1.8.0_73]
  17. at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [rt.jar:1.8.0_73]
  18. at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_73]
  19. at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_73]
  20. at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_73]

Once you re establish your network connections towards the configured check list:

  1. 09:53:23,461 INFO [org.apache.activemq.artemis.core.server.NetworkHealthCheck] Network is healthy, starting service ActiveMQServerImpl::
  2. 09:53:23,462 INFO [org.apache.activemq.artemis.core.server] AMQ221000: live Message Broker is starting with configuration Broker Configuration (clustered=false,journalDirectory=./data/journal,bindingsDirectory=./data/bindings,largeMessagesDirectory=./data/large-messages,pagingDirectory=./data/paging)
  3. 09:53:23,462 INFO [org.apache.activemq.artemis.core.server] AMQ221013: Using NIO Journal
  4. 09:53:23,462 INFO [org.apache.activemq.artemis.core.server] AMQ221043: Protocol module found: [artemis-server]. Adding protocol support for: CORE
  5. 09:53:23,463 INFO [org.apache.activemq.artemis.core.server] AMQ221043: Protocol module found: [artemis-amqp-protocol]. Adding protocol support for: AMQP
  6. 09:53:23,463 INFO [org.apache.activemq.artemis.core.server] AMQ221043: Protocol module found: [artemis-hornetq-protocol]. Adding protocol support for: HORNETQ
  7. 09:53:23,463 INFO [org.apache.activemq.artemis.core.server] AMQ221043: Protocol module found: [artemis-mqtt-protocol]. Adding protocol support for: MQTT
  8. 09:53:23,464 INFO [org.apache.activemq.artemis.core.server] AMQ221043: Protocol module found: [artemis-openwire-protocol]. Adding protocol support for: OPENWIRE
  9. 09:53:23,464 INFO [org.apache.activemq.artemis.core.server] AMQ221043: Protocol module found: [artemis-stomp-protocol]. Adding protocol support for: STOMP
  10. 09:53:23,541 INFO [org.apache.activemq.artemis.core.server] AMQ221003: Deploying queue jms.queue.DLQ
  11. 09:53:23,541 INFO [org.apache.activemq.artemis.core.server] AMQ221003: Deploying queue jms.queue.ExpiryQueue
  12. 09:53:23,549 INFO [org.apache.activemq.artemis.core.server] AMQ221020: Started Acceptor at 0.0.0.0:61616 for protocols [CORE,MQTT,AMQP,STOMP,HORNETQ,OPENWIRE]
  13. 09:53:23,550 INFO [org.apache.activemq.artemis.core.server] AMQ221020: Started Acceptor at 0.0.0.0:5445 for protocols [HORNETQ,STOMP]
  14. 09:53:23,554 INFO [org.apache.activemq.artemis.core.server] AMQ221020: Started Acceptor at 0.0.0.0:5672 for protocols [AMQP]
  15. 09:53:23,555 INFO [org.apache.activemq.artemis.core.server] AMQ221020: Started Acceptor at 0.0.0.0:1883 for protocols [MQTT]
  16. 09:53:23,556 INFO [org.apache.activemq.artemis.core.server] AMQ221020: Started Acceptor at 0.0.0.0:61613 for protocols [STOMP]
  17. 09:53:23,556 INFO [org.apache.activemq.artemis.core.server] AMQ221007: Server is now live
  18. 09:53:23,556 INFO [org.apache.activemq.artemis.core.server] AMQ221001: Apache ActiveMQ Artemis Message Broker version 1.6.0 [0.0.0.0, nodeID=04fd5dd8-b18c-11e6-9efe-6a0001921ad0]

Warning

Make sure you understand your network topology as this is meant to validate your network. Using IPs that could eventually disappear or be partially visible may defeat the purpose. You can use a list of multiple IPs. Any successful ping will make the server OK to continue running