How to Simulate and Fix Cluster Transaction Divergence
How to Simulate and Fix Cluster Transaction Divergence
Let us consider a scenario, for example, where the same sequence of commands are executed on all the cluster nodes in a two-node fabric, but transaction divergence occurred due to the difference in the cluster change IDs.
Follow the tasks (steps) below to configure a sample configuration that artificially causes transaction divergence between two cluster nodes in a fabric:
Note: To simulate transaction divergence, both the selected nodes must be in the same cluster fabric.
- First, view the fabric node details by using the command:
CLI (network-admin@node_1) > fabric-node-show format name,fabname, state,fab-tid name fab-name state fab-tid ------ -------- ------ ------- node_1 fab_1 online 3 node_2 fab_1 online 3 |
- Create a cluster (cluster1-2) between two nodes, node_1 and node_2, by using the command:
CLI (network-admin@node_1) > cluster-create name <cluster_name> clusternode-1 <node_1> cluster-node-2 <node_2> |
as:
CLI (network-admin@node_1*) > cluster-create name cluster1-2 cluster-node-1 node_1 cluster-node-2 node_2 CLI (network-admin@node_1*) > transaction-node-show layout vertical name: node_1 local-tid: 3 fabric-tid: 3 cluster-tid: 0 fabric-changeid: b07616bab667af27b8ac3e5c0af0162faaf8408bac68b84e61679e824bb82276 cluster-changeid:0000000000000000000000000000000000000000000000000000000000000000 |
Note: The asterisk (*) in the CLI indicates a local switch on which the configurations are executed.
- Create, for instance, cluster-scoped VLAN 10 on node_1 and verify that the cluster-tid and the cluster change id have changed and are synced across cluster nodes by using the commands:
CLI (network-admin@node_1*) > vlan-create id 10 scope cluster CLI (network-admin@node_1*) > transaction-node-show layout vertical name: node_1 local-tid: fabric-tid: 3 cluster-tid: 1 <--- tid is increased fabric-changeid: b07616bab667af27b8ac3e5c0af0162faaf8408bac68b84e61679e824bb82276 cluster-changeid: ce27610495ae1c1e799dfcc566f9f8ad910879fba130c07025c86809a7873b1b CLI (network-admin@node_2*) > transaction-node-show layout vertical name: node_2 local-tid: 3 fabric-tid: 3 cluster-tid: 1 fabric-changeid: b07616bab667af27b8ac3e5c0af0162faaf8408bac68b84e61679e824bb82276 cluster-changeid: ce27610495ae1c1e799dfcc566f9f8ad910879fba130c07025c86809a7873b1b |
Note: As expected, the show outputs display synchronized data for the both cluster nodes: tid and changeid are the same (highlighted in bold).
- Now, to cause a divergence, ensure that the allow-offline-cluster-nodes parameter is enabled on both nodes (node_1 and node_2) by using the transaction-settings-modify command. This option is enabled, by default on Netvisor ONE.
CLI (network-admin@node_1*) > transaction-settings-modify allow-offlinecluster-nodes CLI (network-admin@node_1*) > transaction-settings-show allow-offline-cluster-nodes: on auto-recover: on auto-recover-retry-time: 5m reserve-retry-maximum: 10 reserve-retry-interval-maximum(s): 8 CLI (network-admin@node_2*) > transaction-settings-modify allow-offlinecluster-nodes CLI (network-admin@node_2*) > transaction-settings-show allow-offline-cluster-nodes: on auto-recover: on auto-recover-retry-time: 5m reserve-retry-maximum: 10 reserve-retry-interval-maximum(s): 8 |
- Next, in order to artificially cause divergence, bring the protocol logic down on node_2 by using the service nvOSd stop command and verify that node_2 is now in offline state:
root@node_2: $ service svc-nvOSd stop CLI (network-admin@node_2*) > fabric-node-show format name,fabname,state,fab-tid name fab-name state fab-tid ------ --------- ------- ------- node_1 node_1 online 3 node_2 node_1 offline 3 <-- node_2 is offline |
- Now try to create a cluster scope command with VLAN 11 on node_1 by using the following command and view the details:
CLI (network-admin@node_1*) > vlan-create id 11 scope cluster Warning: cluster node node_1 not reachable, continuing anyway Vlans 11 created CLI (network-admin@node_1*) > transaction-node-show layout vertical name: node_1 local-tid: 3 fabric-tid: 3 cluster-tid: 2 <--- tid is increased fabric-changeid: b07616bab667af27b8ac3e5c0af0162faaf8408bac68b84e61679e824bb82276 cluster-changeid: dc0082d1d86937ca79a0357d51f139e31c51b8f6a221f23d20dc4b2b5bb2dc26 <-- changed |
- Now, consider bringing down nvOSd down on node_1 (S0) by using service nvOSd stop command and view the details:
root@node_1: $ service svc-nvOSd stop CLI (network-admin@node_1) > fabric-node-show format name,fabname,state,fab-tid name fab-name state fab-tid ------ --------- ------ ------- node_1 node_1 offline 3 node_2 node_1 offline 3 <--- Both cluster nodes are offline now |
- Consider bringing up nvOSd on node_2 by using the service nvOSd start command and view the details:
root@node_2: $ service svc-nvOSd start CLI (network-admin@node_2) > fabric-node-show format name,fabname,state,fab-tid, name fab-name state fab-tid ------- --------- ------- ------- node_1 node_1 offline 3 node_2 node_1 online 3 CLI (network-admin@node_2*) > transaction-node-show layout vertical name: node_2 local-tid: 3 fabric-tid: 3 cluster-tid: 1 fabric-changeid: b07616bab667af27b8ac3e5c0af0162faaf8408bac68b84e61679e824bb82276 cluster-changeid: ce27610495ae1c1e799dfcc566f9f8ad910879fba130c07025c86809a7873b1b |
Note: In the above show output, the cluster-tid and cluster-changeid displays previous details (as displayed in step 3 and not the latest details) because the transaction synchronization did not occur as the cluster peer (node_1) is down even though the auto-recovery is enabled on node_2.
- Create the same cluster scope command (step 7) on node_2:
CLI (network-admin@node_2*) > vlan-create id 11 scope cluster Warning: cluster node leo-colo-52 not reachable, continuing anyway Vlans 11 created CLI (network-admin@node_2*) > transaction-node-show layout vertical name: node_2 local-tid: 3 fabric-tid: 3 cluster-tid: 2 fabric-changeid:b07616bab667af27b8ac3e5c0af0162faaf8408bac68b84e61679e824bb82276 cluster-changeid:78fed8e45256feb96da8313d244c601409043f188e18c3d58e0e8fb3dd4950df <-- different |
- Consider bringing up nvOSd up on node_1 by using the service nvOSd start command and verify if both nodes are back online:
root@node_1: $ service svc-nvOSd start CLI (network-admin@node_1*) > fabric-node-show format name,fabname,state,fab-tid name fab-name state fab-tid ------ ---------- ------ ------- node_1 node_1 online 3 node_2 node_1 online 3 |
- Try creating a cluster scope command on node_1, which displays an error: fabric error: node_2 transactions have diverged
CLI (network-admin@node_1) > vlan-create id 12 scope cluster vlan-create: fabric error: node_2 transactions have diverged <--- as shown below cluster changeid is diverged CLI (network-admin@node_1) > transaction-node-show switch name local-tid fabric-tid cluster-tid fabric-changeid cluster-changeid ------ --------- --------- ---------- ----------- ---------------------------------------------------------------- ---------------------------------------------------------------- node_1 node_1 3 3 2 b07616bab667af27b8ac3e5c0af0162faaf8408bac68b84e61679e824bb82276 dc0082d1d86937ca79a0357d51f139e31c51b8f6a221f23d20dc4b2b5bb2dc26 node_2 node_2 3 3 2 b07616bab667af27b8ac3e5c0af0162faaf8408bac68b84e61679e824bb82276 78fed8e45256feb96da8313d244c601409043f188e18c3d58e0e8fb3dd4950df CLI (network-admin@leo-colo-52) > cluster-show switch name state cluster-node-1 cluster-node-2 tid mode ports remote-ports cluster-sync-timeout(ms) cluster-sync-offline-count ------ --------- ------ -------------- -------------- --- ------ ----- ------------ ------------------------ -------------------------- node_1 cluster1-2 online node_1 node_2 2 slave 33 33 2000 3 node_2 cluster1-2 online node_1 node_2 2 master 33 33 2000 3 |
- Fix the transaction divergence issue by using the command:
CLI (network-admin@node_1) > transaction-cluster-divergence-fix Warning: this will download config from cluster peer and rollback to tid before it is diverged. It is recommended to take config backup before running this command. Please confirm y/n (Default: n):y |