How to Simulate and Fix Cluster Transaction Divergence

How to Simulate and Fix Cluster Transaction Divergence

Glossary		Print Page		Document Feedback		Support & Services		Pluribus Networks		Pluribus Blog		LinkedIn		Facebook		Twitter

How to Simulate and Fix Cluster Transaction Divergence

Let us consider a scenario, for example, where the same sequence of commands are executed on all the cluster nodes in a two-node fabric, but transaction divergence occurred due to the difference in the cluster change IDs.

Follow the tasks (steps) below to configure a sample configuration that artificially causes transaction divergence between two cluster nodes in a fabric:

Note: To simulate transaction divergence, both the selected nodes must be in the same cluster fabric.

First, view the fabric node details by using the command:

CLI (network-admin@node_1) > fabric-node-show format name,fabname,

state,fab-tid

name fab-name state fab-tid

------ -------- ------ -------

node_1 fab_1 online 3

node_2 fab_1 online 3

Create a cluster (cluster1-2) between two nodes, node_1 and node_2, by using the command:

CLI (network-admin@node_1) > cluster-create name <cluster_name> clusternode-1 <node_1> cluster-node-2 <node_2>

as:

CLI (network-admin@node_1*) > cluster-create name cluster1-2 cluster-node-1

node_1 cluster-node-2 node_2

CLI (network-admin@node_1*) > transaction-node-show layout vertical

local-tid: 3

fabric-tid: 3

cluster-tid: 0

fabric-changeid:

b07616bab667af27b8ac3e5c0af0162faaf8408bac68b84e61679e824bb82276

cluster-changeid:0000000000000000000000000000000000000000000000000000000000000000

Note: The asterisk (*) in the CLI indicates a local switch on which the configurations are executed.

Create, for instance, cluster-scoped VLAN 10 on node_1 and verify that the cluster-tid and the cluster change id have changed and are synced across cluster nodes by using the commands:

CLI (network-admin@node_1*) > vlan-create id 10 scope cluster

CLI (network-admin@node_1*) > transaction-node-show layout vertical

local-tid:

fabric-tid: 3

cluster-tid: 1 <--- tid is increased

fabric-changeid: b07616bab667af27b8ac3e5c0af0162faaf8408bac68b84e61679e824bb82276

cluster-changeid: ce27610495ae1c1e799dfcc566f9f8ad910879fba130c07025c86809a7873b1b

CLI (network-admin@node_2*) > transaction-node-show layout vertical

local-tid: 3

fabric-tid: 3

cluster-tid: 1

fabric-changeid: b07616bab667af27b8ac3e5c0af0162faaf8408bac68b84e61679e824bb82276

cluster-changeid: ce27610495ae1c1e799dfcc566f9f8ad910879fba130c07025c86809a7873b1b

Note: As expected, the show outputs display synchronized data for the both cluster nodes: tid and changeid are the same (highlighted in bold).

Now, to cause a divergence, ensure that the allow-offline-cluster-nodes parameter is enabled on both nodes (node_1 and node_2) by using the transaction-settings-modify command. This option is enabled, by default on Netvisor ONE.

CLI (network-admin@node_1*) > transaction-settings-modify allow-offlinecluster-nodes

CLI (network-admin@node_1*) > transaction-settings-show

allow-offline-cluster-nodes: on

auto-recover: on

auto-recover-retry-time: 5m

reserve-retry-maximum: 10

reserve-retry-interval-maximum(s): 8

CLI (network-admin@node_2*) > transaction-settings-modify allow-offlinecluster-nodes

CLI (network-admin@node_2*) > transaction-settings-show

allow-offline-cluster-nodes: on

auto-recover: on

auto-recover-retry-time: 5m

reserve-retry-maximum: 10

reserve-retry-interval-maximum(s): 8

Next, in order to artificially cause divergence, bring the protocol logic down on node_2 by using the service nvOSd stop command and verify that node_2 is now in offline state:

root@node_2: $ service svc-nvOSd stop

CLI (network-admin@node_2*) > fabric-node-show format name,fabname,state,fab-tid

name fab-name state fab-tid

------ --------- ------- -------

node_1 node_1 online 3

node_2 node_1 offline 3 <-- node_2 is offline

Now try to create a cluster scope command with VLAN 11 on node_1 by using the following command and view the details:

CLI (network-admin@node_1*) > vlan-create id 11 scope cluster

Warning: cluster node node_1 not reachable, continuing anyway

Vlans 11 created

CLI (network-admin@node_1*) > transaction-node-show layout vertical

local-tid: 3

fabric-tid: 3

cluster-tid: 2 <--- tid is increased

fabric-changeid: b07616bab667af27b8ac3e5c0af0162faaf8408bac68b84e61679e824bb82276

cluster-changeid: dc0082d1d86937ca79a0357d51f139e31c51b8f6a221f23d20dc4b2b5bb2dc26 <-- changed

Now, consider bringing down nvOSd down on node_1 (S0) by using service nvOSd stop command and view the details:

root@node_1: $ service svc-nvOSd stop

CLI (network-admin@node_1) > fabric-node-show format name,fabname,state,fab-tid

name fab-name state fab-tid

------ --------- ------ -------

node_1 node_1 offline 3

node_2 node_1 offline 3 <--- Both cluster nodes are offline now

Consider bringing up nvOSd on node_2 by using the service nvOSd start command and view the details:

root@node_2: $ service svc-nvOSd start

CLI (network-admin@node_2) > fabric-node-show format name,fabname,state,fab-tid, name fab-name state fab-tid

------- --------- ------- -------

node_1 node_1 offline 3

node_2 node_1 online 3

CLI (network-admin@node_2*) > transaction-node-show layout vertical

local-tid: 3

fabric-tid: 3

cluster-tid: 1

fabric-changeid: b07616bab667af27b8ac3e5c0af0162faaf8408bac68b84e61679e824bb82276

cluster-changeid: ce27610495ae1c1e799dfcc566f9f8ad910879fba130c07025c86809a7873b1b

Note: In the above show output, the cluster-tid and cluster-changeid displays previous details (as displayed in step 3 and not the latest details) because the transaction synchronization did not occur as the cluster peer (node_1) is down even though the auto-recovery is enabled on node_2.

Create the same cluster scope command (step 7) on node_2:

CLI (network-admin@node_2*) > vlan-create id 11 scope cluster

Warning: cluster node leo-colo-52 not reachable, continuing anyway

Vlans 11 created

CLI (network-admin@node_2*) > transaction-node-show layout vertical

local-tid: 3

fabric-tid: 3

cluster-tid: 2

fabric-changeid:b07616bab667af27b8ac3e5c0af0162faaf8408bac68b84e61679e824bb82276

cluster-changeid:78fed8e45256feb96da8313d244c601409043f188e18c3d58e0e8fb3dd4950df <-- different

Consider bringing up nvOSd up on node_1 by using the service nvOSd start command and verify if both nodes are back online:

root@node_1: $ service svc-nvOSd start

CLI (network-admin@node_1*) > fabric-node-show format name,fabname,state,fab-tid

name fab-name state fab-tid

------ ---------- ------ -------

node_1 node_1 online 3

node_2 node_1 online 3

Try creating a cluster scope command on node_1, which displays an error: fabric error: node_2 transactions have diverged

CLI (network-admin@node_1) > vlan-create id 12 scope cluster

vlan-create: fabric error: node_2 transactions have diverged <--- as shown below cluster changeid is diverged

CLI (network-admin@node_1) > transaction-node-show

switch name local-tid fabric-tid cluster-tid fabric-changeid cluster-changeid

------ --------- --------- ---------- ----------- ----------------------------------------------------------------

----------------------------------------------------------------

node_1 node_1 3 3 2 b07616bab667af27b8ac3e5c0af0162faaf8408bac68b84e61679e824bb82276

dc0082d1d86937ca79a0357d51f139e31c51b8f6a221f23d20dc4b2b5bb2dc26

node_2 node_2 3 3 2 b07616bab667af27b8ac3e5c0af0162faaf8408bac68b84e61679e824bb82276

78fed8e45256feb96da8313d244c601409043f188e18c3d58e0e8fb3dd4950df

CLI (network-admin@leo-colo-52) > cluster-show

switch name state cluster-node-1 cluster-node-2 tid mode ports remote-ports cluster-sync-timeout(ms) cluster-sync-offline-count

------ --------- ------ -------------- -------------- --- ------ ----- ------------ ------------------------ --------------------------

node_1 cluster1-2 online node_1 node_2 2 slave 33 33 2000 3

node_2 cluster1-2 online node_1 node_2 2 master 33 33 2000 3

Fix the transaction divergence issue by using the command:

CLI (network-admin@node_1) > transaction-cluster-divergence-fix

Warning: this will download config from cluster peer and rollback to tid before it is diverged. It is recommended to take config backup before running this command.

Please confirm y/n (Default: n):y