About Split Brain

If a node in a cluster fails to communicate with its peer (for three consecutive sync messages), the node sets the cluster to offline and functions in independent mode.

This transition is fine when an entire node breaks down, because the sole surviving node has to take up full responsibility to provide continuity of service on its own, without load splitting.

On the other hand, when it’s the communication between the two nodes to break down (not an actual device) for an extended period of time, both nodes remain alive but isolated, and the cluster would have to be set to offline state. This condition is called split brain and is undesirable, because the two halves of the cluster work independently instead of in sync with each other, potentially causing network issues.

To avoid ending up with a split brain, that is, with a split cluster, Netvisor ONE can be configured to use a secondary communication channel to verify the health of both nodes. Then, if both nodes are found to be healthy but the inter-communication remains broken, the cluster is flagged as being affected by the split brain condition and is moved to a special state to prevent potential network issues.

To detect split brain, instead of going offline right away in case of broken communication on the primary channel, Netvisor ONE first sets the state of each node to remote-down while it uses a secondary communication channel to reach the peer node. A dedicated management network is often employed by various network designs, hence that can be leveraged as secondary communication channel. If each node can reach its peer through this network (determining that the nodes are indeed healthy but isolated), the master node will then transition to a new state called split-active while the slave node will be set to split-suspended state.

When a slave is in split-suspended state, it brings down its vLAG members to avoid split brain issues on the vLAGs. It also brings down any Layer 3 forwarding function to avoid incorrect routing issues. On the other hand, the master in split-active state will bring up any standby ports that are members of an active-standby vLAG to guarantee continuity of service of that vLAG.

When communication recovery occurs on the primary communication channel, the two nodes will try to re-sync by periodically sending messages to each other. As soon as mutual reach-ability is verified by the nodes, the state of the cluster is set to online.

On the other hand, if a node cannot communicate with its peer using both primary and secondary network, the cluster state is set to offline.