Yesterday, we had one host in our recovery site PSOD, and that caused all kinds of errors in Zerto, primarily related to VPGs. In our case, this particular host had both inbound and outbound VPGs attached to it’s VRA, and we were unable to edit (edit button in VPG view was grayed out, along with the “Edit VPG” link when clicking in to the VPG) any of them to recover from the host failure. Previously when this would happen, we would just delete the VPG(s) and recreate it/them, preserving the disk files as pre-seeded data.
When you have a few of these to re-do, it’s not a big deal, however, when you have 10 or more, it quickly becomes a problem.
One thing that I discovered that I didn’t know was in the product, is that if you click in to the VRA associated with the failed host, and go do the MORE link, there’s an option in there to “Change Recovery VRA.” This option will allow you to tell Zerto that anything related to this VRA should now be pointed at X. Once I did that, I was able to then edit the VPGs. I needed to edit the VPGs that were outbound, because they were actually reverse-protected workloads that were missing some configuration details (NIC settings and/or Journal datastore).
- Log on to the Zerto UI.
- Once logged on, click on the Setup tab.
- In the “VRA Name” column, locate the VRA associated with the failed host, and then click the link (name of VRA) to open the VRA in a new tab in the UI.
- Click on the tab at the top that contains VRA: Z-VRA-[hostName].
- Once you’re looking at the VRA page, click on the MORE link.
- From the MORE menu, click Change VM Recovery VRA.
- In the Change VM Recovery VRA dialog, check the box beside the VPG/VM, then select a replacement host. Once all VPGs have been udpated, click Save.
Once you’ve saved your settings, validate that the VPG can be edited, and/or is once again replicating.
Following an upgrade to ESXi 6.0 U2, this particular issue has popped up a few times, and while we still have a case open with VMware support in an attempt to understand root cause, we have found a successful workaround that doesn’t require any downtime for the running workloads or the host in question. This issue doesn’t discriminate between iSCSI or Fibre Channel storage, as we’ve seen it in both instances (SolidFire – iSCSI, IBM SVC – FC). One common theme with where we are seeing this problem is that it is happening in clusters with 10 or more hosts, and many datastores. It may also be helpful to know that we have two datastores that are shared between multiple clusters. These datastores are for syslogs and ISOs/Templates.
Note: In order to perform the steps in this how-to, you will need to already
have SSH running and available on the host, or access to the DCUI.
- Following a host or cluster storage rescan, an ESXi host(s) stops responding in vCenter and still has running VMs on it (host isolation)
- Attempts to reconnect the host via vCenter doesn’t work
- Direct client connection (thick client) to host doesn’t work
- Attempts to run services.sh from the CLI causes script to hang after “running sfcbd-watchdog stop“. The last thing on the screen is “Exclusive access granted.”
- The /var/log/vmkernel.log displays the following at this point: “Alert: hostd detected to be non-responsive“
The following troubleshooting steps were obtained from VMware KB Article 1003409
- Verify the host is powered on.
- Attempt to reconnect the host in vCenter
- Verify that the ESXi host is able to respond back to vCenter at the correct IP address and vice versa.
- Verify that network connectivity exists from vCenter to the ESXi host’s management IP or FQDN
- Verify that port 903 TCP/UDP is open between the vCenter and the ESXi host
- Try to restart the ESXi management agents via DCUI or SSH to see if it resolves the issue
- Verify if the hostd process has stopped responding on the affected host.
- verify if the vpxa agent has stopped responding on the affected host.
- Verify if the host has experienced a PSOD (Purple Screen of Death).
- Verify if there is an underlying storage connectivity (or other storage-related) issue.
Following these troubleshooting steps left me at step 7, where I was able to determine if hostd was responding on the host. The vmkernel.log further supports this observation.
These are the steps I’ve taken to remedy the problem without having to take the VMs down or reboot the host:
- Since the hostd service is not responding, the first thing to do is run /etc/init.d/hostd restart from a second SSH session window (leaving the first one with the hung services.sh restart script process).
- While running the hostd restart command, the hung session will update, and produce the following:
- When you see that message, press enter to be returned to the shell prompt.
- Now run /etc/init.d/vpxa restart, which is the vCenter Agent on the host.
- After that completes, re-run services.sh restart and this time it should run all the way through successfully.
- Once services are all restarted, return to the vSphere Web Client and refresh the screen. You should now see the host is back to being managed, and is no longer disconnected.
- At this point, you can either leave the host running as-is, or put it into maintenance mode (vMotion all VMs off). Export the log bundle if you’d like VMware support to help analyze root cause.
I hope you find this useful, and if you do, please comment and share!