Disablement of Front-Panel Ports During nvOSd Failure

When a switch failure occurs, it is important to quickly recover from the failure, and also collect information that facilitates the identification of  root cause of the failure so that the failure can be avoided and/or addressed in future events.  Often the two objectives of recovery and root cause analysis interfere with each other.  An expedited recovery may result in insufficient data collection to identify the root cause, and the data collection process may expose the customer to the failure for a longer period of time. 

In the case of an nvOSd failure, NetVisor features data collection scripts that capture critical forensic data required to troubleshoot offline after the switch has been recovered.  The collection scripts can run for more than five minutes, during which time, peer devices may continue to direct traffic towards the failed switch. Depending on the nature of the switch failure, this may result in traffic being mis-routed or black-holed.

Starting with NetVisor OS 7.1.0 release, any time a failure of the nvOSd service is detected, NetVisor automatically brings down the front-panel ports before proceeding with the data collection script and eventually restarting the service. This change in behavior enables the existing traffic to get re-directed through redundant paths.  Data capture proceeds as normal without causing risk to existing traffic flows. 

This functionality is implemented for the following failure scenarios:

  • When nvOSd service keep-alive timeouts (where the number of missed keep-alive requests from nvOS_mon service exceeds the configured limit) 
  • When nvOSd service segmentation faults resulting in a process crash.

This feature is available on the following platforms:

  • NRU02
  • NRU03
  • NRU-S0301 

Note: This functionality does not require any configuration changes. The behavior change takes effect upon upgrade to NetVisor OS version 7.1.0.
