When NSX 6.2.3 was released earlier this year, a sentence in the release notes about failover caught my attention.
NSX Edge — On Demand Failover: Enables users to initiate on-demand failover when needed.
Well, since that moment, NSX 6.2.3 has been replaced by NSX 6.2.4 (sort of), but this statement is still true.
About NSX Edge High Availability
Before jumping straight into the subject, I would like to come back on the HA feature itself. High Availability ensures that the services provided by NSX Edge appliances are available even when a hardware or software failure renders a single appliance unavailable. Please keep in mind that NSX Edge HA is not a fault tolerant solution, but it helps to minimize failover downtime.
The high availability provided is stateful, meaning that NSX Edge HA synchronizes the connection tracker of the stateful firewall or the stateful information held by the load balancer.
Primary and secondary NSX Edge appliances are respectively in active and standby states, and all services run on the active appliance. The primary appliance maintains a heartbeat with the standby appliance and sends service updates through an internal interface.
If a heartbeat is not received from the primary appliance within the specified time (default value is 15 seconds), the primary appliance is declared dead. The standby appliance moves to the active state, takes over the interface configuration of the primary appliance, and starts the NSX Edge services that were running on the primary appliance.
How to trigger NSX Edge failover?
Starting in NSX 6.2.3 / 6.2.4, you can now trigger a high availability failover on the active NSX Edge appliance by setting the value of haAdminState to down. The haAdminState determines whether or not an NSX Edge appliance is participating in high availability. Both appliances in an NSX Edge high availability configuration normally have an haAdminState of up. When you set the haAdminState of the active appliance to be down, it will stop participating in high availability, and will inform the standby appliance of its status. The standby appliance will become active immediately.
To start, I will check which appliance is the active one in the vSphere Web Client.
Note: you can also check the status in CLI with the show service highavailability set of commands.
NSX-edge-2-0> show service highavailability Highavailability Service: Highavailability Status: Active Highavailability State since: 2016-09-13 14:27:38.071 Highavailability Unit Id: 0 Highavailability Unit State: Up Highavailability Admin State: Up Highavailability Running Nodes: 0, 1 Unit Poll Policy: Frequency: 3.75 seconds Deadtime: 15 seconds Highavailability Services Status: Healthcheck Config Channel: Up Healthcheck Status Channel: Up Highavailability Healthcheck Status: This unit : Up Active: 1 Peer unit : Up Active: 0 Session via vNic_1: 169.254.1.5:169.254.1.6 Up Config Engine: HA Configuration: Enabled HA Admin State: Up Config Engine Status: Active Highavailability Stateful Logical Status: File-Sync running Connection-Sync running xmit xerr rcv rerr 990228 0 1948976 0
Notice the following parameters above:
- Highavailability Status: Active
- HA Admin State: Up
Now, if I want to manually force an HA failover, the steps are quite simple. First, I need to get the highAvailabilityIndex for each appliance with the following request:
Secondly (and lastly), the following API call will trigger the failover by taking down the active edge (defined by haIndex).
<appliance> <highAvailabilityIndex>0</highAvailabilityIndex> <vcUuid>503133bd-9e10-e606-e99b-4398608d7eaf</vcUuid> <vmId>vm-62</vmId> <haAdminState>down</haAdminState> <resourcePoolId>domain-c11</resourcePoolId> ... </appliance>
Notice the <haAdminState>down</haAdminState>in the body?
Important note: pay attention to the values in the body as it might influence the placement of the edge appliance (such as ESXi specification, etc.).
Note: I had a ping running during the failover; the 10.10.10.1 IP is a logical router (DLR) interface behind the Edge.
Using the vSphere Web Client, I can confirm that my 2nd appliance is now the active one.
When the switch over takes place, a system event is displayed in the System Events tab.
A simple PUT operation with a body defining the haAdminState to UP will make the appliance to participate again in high availability.
Load Balancer and VPN services need to re-establish TCP connection with NSX Edge, so service is disrupted for a short while. Logical switch connections and firewall sessions are synced between the primary and standby appliances, so there is no service disruption during switch over.