How to Fix the Cluster Transaction Divergence Issue


When a cluster transaction divergence occur, you cannot execute any further cluster-scoped configurations on either of the cluster nodes. With the release of Netvisor ONE version 5.1.2, we provide a solution to fix the cluster transaction divergence issue.


To ensure high availability, Netvisor ONE allows cluster scope transactions to proceed even when one of the peer nodes in a cluster goes offline (enabled, by default, by  using the command: transaction-settings-modify with  allow-offline-cluster-nodes parameter; which can be disabled, you can disable this option, if required) during configuration changes. However,  this may cause a transaction divergence as explained in Figure 6-10.  For example, consider a cluster with two nodes- node 1 and node2, both are online and configurations commands are executed on both nodes, and the transactions are synchronized until a certain time where the TID=T. After certain time, node2 goes offline, but transactions continue on node1 and the TID on node1 changes to T1. Now, both the nodes go offline and then node2 comes back online,  and transactions continue on node2, where the TID changes to T2.  Now, when node1 also comes back online, the transactions get diverged as illustrated in Figure 6-10


Figure 6-10: Cluster Transaction Divergence


The two possible transaction divergence cases as illustrated in Figure 6-10 include:


  • Case 1: Even if X number of transactions and Y number of transactions are the same and is also executed in the same sequence  (T1 = T2), still there will be transaction divergence due to the change in cluster-change IDs.
  • Case 2: If the X and Y transactions on the nodes are  a separate (different) set of commands or is executed in different sequence, then also transaction divergence occurs.


The auto-recovery feature, which is enabled by default on Netvisor ONE, automatically rolls forward the transactions with the peers when the offline cluster node is brought up online when there is no transaction divergence. But, in the case of transaction divergence, you have to manually roll-back to a common transaction point for the auto-recovery feature to synchronize (for details, see the previous sections of Troubleshooting the Fabric section) . As this is a manual process, it is error prone (because if you mistakenly roll-back to a different transaction point, then you have to repeat the process to get to the common transaction point) .


To mitigate this issue, Netvisor ONE introduces a new command: transaction-cluster-divergence-fix.  You can run this local-scoped command on any of the cluster nodes. However, Pluribus recommends to run this command on a slave node where the divergence occurred. 


In case 1 above, when you run the transaction-cluster-divergence-fix. command on a node, Netvisor ONE  updates the change ID s of all diverged transaction IDs to match with the peer node.


In case 2 above, when you run the command on a node, Netvisor fetches the transaction details from the cluster peer and rolls-back the configuration to a common transaction ID on the node on which the command is executed. Then, the node re-synchronizes the configuration with the cluster peer configuration if the auto-recovery function (called auto-recover) that makes sure that the transactions get automatically re-synchronized is set to ON (This capability is enabled by default). For details on the auto-recovery function, see the Keeping Transactions in Sync with Auto-Recovery section. 


Note: It is recommended to take a backup of the current configuration before issuing the transaction-cluster-divergence-fix command. Ensure that the fabric communication between the nodes in the cluster is enabled for this functionality to work.


To fix the cluster transaction divergence issue, use the  command on the failed cluster node (recommended to run on a slave node):


CLI (network-admin@node_slave) > transaction-cluster-divergence-fix

Warning: this will download config from cluster peer and rollback to tid before it is diverged. It is recommended to take config backup before running this command.

Please confirm y/n (Default: n):y


Related Commands


CLI (network-admin@pn-test-2) > transaction-settings-modify 


Specify one or more of the following options:


allow-offline-cluster-nodes|no-allow-offline-cluster-nodes

Specify to allow transactions to proceed even if a cluster node is offline. By default, the allow-offline-cluster-nodes parameter is enabled on nvOS.

auto-recover|no-auto-recover

Specify to automatically recover missed transactions.

auto-recover-retry-time duration: #d#h#m#s

Specify the retry time for transaction to auto-recover.

reserve-retry-maximum reserve-retry-maximum-number

Specify the maximum number of retires for transaction reservation; 0 for infinite.

reserve-retry-interval-maximum reserve-retry-interval-maximum-number (s)

Specify the maximum number of seconds to wait between transaction reservation retries.



north
    keyboard_arrow_up
    keyboard_arrow_down
    description
    print
    feedback
    support
    business
    rss_feed
    south