Virtual Machine Reset By HA-VMware Tools Heartbeat failure


Today while investigating one production issue, I came across an incident where a virtual machine restart was performed by HA.

Navigating through task and events for the VM in vCenter, I was seeing message similar to

This virtual machine reset by HA. Reason: VMware Tools heartbeat failure. A screenshot is saved at /vmfs/volumes/4c4850e4-0dcce710-28d9-00215-a5d36b8/XYZ/xyz-0.png

On investigating further, I came to find out that VMware tools running on VM was out of date.

vm-1

I checked for HA settings related to VM monitoring on the cluster  and did not see any aggressive value set there which might have caused the vm restart. The VM monitoring settings were set at 30 secs for failure interval which means that if for continuous 30 seconds VM is unable to send any heartbeat to HA, a restart of VM will be performed.

Keep in mind that this is not related to host failure where all VM running on failed host is restarted on remaining nodes of the cluster. In case of single VM failure, HA will restart the VM on same host.

vm-monitoring-settings

I started my investigation with checking hostd.log file of the Esxi host where VM was running and observed following messages which hints on VM restart

2016-11-06T06:00:52.281Z info hostd[50A40B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/521a8820-77938204-1461-002590c50caa/XYZ (4c4850e4-0dcce710-28d9-00215-a5d36b8)/XYZ (4c4850e4-0dcce710-28d9-00215-a5d36b8).vmx] State Transition (VM_STATE_ON -> VM_STATE_CREATE_SCREENSHOT)

2016-11-06T06:00:52.353Z info hostd[5114CB70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/521a8820-77938204-1461-002590c50caa/XYZ (4c4850e4-0dcce710-28d9-00215-a5d36b8)/XYZ (4c4850e4-0dcce710-28d9-00215-a5d36b8).vmx] State Transition (VM_STATE_CREATE_SCREENSHOT -> VM_STATE_ON)

2016-11-05T00:26:30.020Z info hostd[50F44B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/521a8820-77938204-1461-002590c50caa/XYZ (4c4850e4-0dcce710-28d9-00215-a5d36b8)/XYZ (4c4850e4-0dcce710-28d9-00215-a5d36b8).vmx] Send config update invoked

2016-11-05T00:26:30.049Z info hostd[502C6B70] [Originator@6876 sub=Snmpsvc] VmConfigListener: vm state change received, queueing reload request

2016-11-05T01:58:44.821Z info hostd[50507B70] [Originator@6876 sub=Snmpsvc] VmConfigListener: vm state change received, queueing reload request

2016-11-05T01:58:49.097Z info hostd[50244B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/521a8820-77938204-1461-002590c50caa/XYZ (4c4850e4-0dcce710-28d9-00215-a5d36b8)/XYZ (4c4850e4-0dcce710-28d9-00215-a5d36b8).vmx] Upgrade is required for virtual machine, version: 10

Next was to check vmware.log for VM XYZ. I observed following log entries for VM reset

2016-11-01T18:13:53.321Z| vmx| I120: Vix: [43327 vmxCommands.c:685]: VMAutomation_Reset. Trying hard reset
2016-11-01T18:13:53.321Z| vmx| W110:
2016-11-01T18:13:53.321Z| vmx| W110+
2016-11-01T18:13:53.321Z| vmx| W110+ VMXRequestReset
2016-11-01T18:13:53.321Z| vmx| I120: Vigor_Reset: Attaching to reset.
2016-11-01T18:13:53.321Z| vmx| I120: Stopping VCPU threads...

On digging vmware.log more, I found the messages regarding VMware tools version out of date

2016-10-27T04:25:34.389Z| vmx| I120: VMXVmdb_SetToolsVersionStatus: status value set to 'oldTools', 'supportedOld', install possible

2016-10-27T04:25:37.656Z| vcpu-3| I120: ToolsSetVersionWork did nothing; new tools version (9344) matches old Tools version

2016-10-27T04:25:37.832Z| Worker#1| I120: ToolsVersion: Status is supported old because this is status of monolithic version, and vsock_win2k3 and 25 more components are missing from the guest.

2016-10-27T04:25:37.832Z| Worker#1| I120: ToolsVersionGetStatusWorkerThread: Tools status 2 derived from environment

2016-10-27T04:25:37.832Z| vmx| I120: ToolsUpdateManifestInfoWorkerThreadDone: Compared tools manifest from host and from the guest. Status = 2.

2016-10-27T04:25:37.832Z| vmx| I120: ToolsUpdateManifestInfoWorkerThreadDone: Updating the manifest info.

2016-10-27T04:25:37.832Z| vmx| I120: VMXVmdb_SetToolsVersionStatus: status value set to 'oldTools', 'supportedOld', install possible

2016-10-27T04:25:37.832Z| vmx| I120: TOOLS installed legacy version 9344, available legacy version 9541

2016-10-27T04:25:37.832Z| vmx| I120: TOOLS manifest update status is 2

2016-10-27T04:25:37.832Z| vmx| I120: TOOLS can be autoupgraded.

2016-10-27T04:25:37.832Z| vmx| I120: TOOLS VM tools upgrade policy 1

2016-10-27T04:25:37.832Z| vmx| I120: TOOLS need not be autoupgraded according to upgrade policy.

2016-10-27T04:25:37.832Z| vmx| I120: TOOLS Resumed after power-on autoupgrade check.

2016-10-27T04:25:37.832Z| vmx| I120: TOOLS Setting autoupgrade-checked TRUE.

Also found VMware tools heartbeat timing out which eventually forced HA to reset the VM

2016-10-27T04:33:10.583Z| vcpu-0| I120: Tools: Tools heartbeat timeout.

2016-10-27T04:34:26.229Z| vcpu-3| I120: GuestMsg: Channel 0, Protocol error, state: 0

2016-10-27T04:34:26.230Z| vcpu-3| I120: GuestMsg: Cannot close channel 0: it is not opened

2016-10-27T04:34:28.597Z| vcpu-0| I120: Guest: toolbox: Version: build-1280544

2016-10-27T04:34:28.597Z| vcpu-0| W110: GuestRpc: application toolbox, changing channel 65535 -> 0

2016-10-27T04:34:28.597Z| vcpu-0| I120: GuestRpc: Channel 0, guest application toolbox.

2016-10-27T04:34:29.924Z| vcpu-1| I120: TOOLS autoupgrade protocol version 2

From the above logs it was evident that vmware tools were unable to send heartbeat to HA and which caused HA to perform a VM reset.

VMware KB-1027734 was very helpful to me in carrying out my investigation.

Conclusion: From the above I can conclude that Vmware tools update is needed on the VM so that HA should not try to reset VM again.

I hope this post is informational to you. Feel free to share this on social media if it is worth sharing. Be sociable 🙂

About Alex Hunt

Hi All I am Manish Kumar Jha aka Alex Hunt. I am currently working in VMware Software India Pvt Ltd as Operations System Engineer (vCloud Air Operations). I have around 5 Years of IT experience and have exposure on VMware vSphere, vCloud Director, RHEL and modern data center technologies like Cisco UCS and Cisco Nexus 1000v and NSX. If you find any post informational to you please press like and share it across social media and leave your comments if you want to discuss further on any post. Disclaimer: All the information on this website is published in good faith and for general information purpose only. I don’t make any warranties about the completeness, reliability and accuracy of this information. Any action you take upon the information you find on this blog is strictly at your own risk. The Views and opinions published on this blog are my own and not the opinions of my employer or any of the vendors of the product discussed.
This entry was posted in Vmware. Bookmark the permalink.

One Response to Virtual Machine Reset By HA-VMware Tools Heartbeat failure

  1. zsoltesz says:

    As I know, the ESXi performs other checks in case of Tools heartbeat time out. For example checks the disk IO.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s