Unable to reach/ping Cluster role VIP
- Windows cluster with two nodes VM01 and VM02
- There are two SQL FCI's installed 2016
- Each node has two NICs, one for the LAN and management network, and one for the heartbeat network
- The cluster consists of three Network resource; a cluster IP address and 2 SQL instance addresses which float between the two nodes depending on which one is active.
- Check Windows Logs -nothing clear or related to the issue!
- Checking SQL Logs -nothing related to the issue
- Patch Windows And SQL to the latest update - still can't ping
- Disable Symantec EP Firewall - still can't ping
- Run Windows failover cluster validation - All tests where passed
Meanwhile, I asked the customer to failover the File server role to second node , and suddenly the file server IP become unreachable, I came to know that the issue affecting all Windows failover cluster role in the Customer Site!
My Colleague, he is a senior network Engineer start checking the network switches and firewalls, he realized that the MAC address associated with the cluster IP addresses wasn’t changing to the MAC address of node VM02 when we failover the role from VM01 to VM02 – which is what we would expect as a result of the failover operation
commands he used during his troubleshooting:
- Show ip arp 10.10.2.x - "SQL Cluster IP"
- Clear ip arp 10.10.2.x - "SQL Cluster IP"
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters -REG_DWORD > ArpRetryCount
0: don't send garp
1: send garp once only
2: send garp twice
3: send garp three times (The Default Value)
From Network Side make sure to enable the garp-reply:-
To enable on Juniper EX & SRX platform – user the following command –
set interface interface_name/number gratuitous-arp-reply
The interface can be a physical interface, logical interface, interface group, SVI or IRB To enable GARP
on Cisco IOS – use interface command ip gratuitous-arps
Note: It just for troubleshooting purpose. Mainly we disable GARP from server side. In VMware environment "Virtual machines hosted on ESXI", it mandates to disable if you have Active-Active, Active-Passive sites. in order to send L2 packets to Core Switches
Originally Posted @ Microsoft Wiki