Fast Failover for STP and Cluster

Previously, cluster STP operation did not support fast failover because NetVisor OS did not share STP state between the two nodes. As a result, when the master failed and when it came back online, the slave had to recompute the STP state from scratch. This resulted in topology changes twice, causing traffic loss until STP converged.

Currently, NetVisor OS supports fast failover by default. In the cluster-STP mode, the node that has been up longer is elected as the master. Cluster syncs (keep-alives) are used to detect the peer node being  online, and negotiate the initial cluster state. Cluster syncs determine which node has been up longer based on exchanged uptime values.

The master runs the state machine for both nodes and sends the STP and port states to the slave. The slave in turn maintains the state as informed by the master. The slave generates its own BPDUs based on the synchronized state and forwards the BPDUs that it receives to the master. When the cluster goes offline, the slave/master uses the same bridge ID and priority, uses consistent port IDs in BPDUs,  and continues from the existing synchronized state. The STP state machine state is thus never lost.

Internal state synchronization using consistent bridge ID/priority and port IDs regardless of whether the cluster is online or offline, and active-active vLAG handling ensure that an end node detects no topology change when the cluster nodes go offline/online.

When a cluster is created, the STP configuration between the two cluster nodes is checked and is synchronized. The following guidelines are true regardless of whether the cluster is online  or offline, and whether the peer node is online or offline:

  • Both nodes use the same bridge ID or priority.
  • When node1 sends a BPDU, the port ID inside the packet is 1-256, except for active-active vLAGs.
  • When node2 sends a BPDU, the port ID inside the packet is 257-512, except for active-active vLAGs.
  • When either node sends a BPDU on an active-active vLAG, the port ID inside the packet is node1's port number.
  • Configuration changes (STP mode, MST instances, bridge ID, etc.) are mirrored on both nodes through cluster transactions.

Due to the above guidelines, a BPDU sent on an active-active vLAG  appears exactly the same to a third party receiver regardless of whether that packet came from cluster node1 or node2.

NetVisor OS provides two show commands to view the details of this functionality: stp-state-show and stp-port-state-show.

For example:

CLI (network-admin@Leaf1) > stp-state-show

switch:           Leaf-1

vlan:             1

ports:            none

instance-id:      1

name:             stg-default

bridge-id:        66:0e:94:d5:b0:cc

bridge-priority:  32769

root-id:          66:0e:94:35:c2:ce

root-priority:    32769

root-port:        128

hello-time:       2

forwarding-delay: 15

max-age:          20

disabled:         none

learning:         none

forwarding:       none

discarding:       none

edge:             none

designated:       none

alternate:        none

backup:           none

CLI (network-admin@Switch2) > stp-port-state-show port 17

switch:              Switch2

vlan:                1

port:                17

stp-state:           Forwarding

role:                Designated

selected-role:       Designated

state:               agreed,learn,learning,forward,forwarding,selected,send-rstp,synced,online,requested-online

designated-priority: 32769-66:0e:94:38:39:80,100,32769-66:0e:94:b7:65:91,32785

port-priority:       32769-66:0e:94:38:39:80,100,32769-66:0e:94:b7:65:91,32785

message-priority:    0-00:00:00:00:00:00,0,0-00:00:00:00:00:00,0

info-is:             mine

hello-timer:         2

root-guard-timer:    0

sm-table-bits:       0xfaedee

sm-table:            prx=discard*,bdm=not-edge*,ptx=idle*,pim=current*,prt-disabled=disable*,prt-root=root*,prt-desg=designated*,prt-alt-bk=block*,pst=forwarding*,tcm=active*