Zerto Virtual Manager Outage, Replication, and Self-Healing

I’ve decided to explore what happens when a ZVM (Zerto Virtual Manager) in either the protected site or the recovery site is down for a period of time, and what happens when it is back in service, and most importantly, how an outage of either ZVM affects replication, journal history, and the ability to recover a workload.

Before getting in to it, I have to admit that I was happy to see how resilient the platform is through this test, and how the ability to self-heal is a built in “feature” that rarely gets talked about.

Questions:

  • Does ZVR still replicate when a ZVM goes down?
  • How does a ZVM being down affect checkpoint creation?
  • What can be recovered while the ZVM is down?
  • What happens when the ZVM is returned to service?
  • What happens if the ZVM is down longer than the configured Journal History setting?

Acronym Decoder & Explanations

ZVMZerto Virtual Manager
ZVRZerto Virtual Replication
VRAVirtual Replication Appliance
VPGVirtual Protection Group
RPORecovery Point Objective
RTORecovery Time Objective
BCDRBusiness Continuity/Disaster Recovery
CSPCloud Service Provider
FOTFailover Test
FOLFailover Live

Does ZVR still replicate when a ZVM goes down?

The quick answer is yes.  Once a VPG is created, the VRAs handle all replication.    The ZVM takes care of inserting and tracking checkpoints in the journal, as well as automation and orchestration of Virtual Protection Groups (VPGs), whether it be for DR, workload mobility, or cloud adoption.

In the protected site, I took the ZVM down for over an hour via power-off to simulate a failure.  Prior to that, I made note of the last checkpoint created.  As the ZVM went down, within a few seconds, the protected site dashboard reported RPO as 0 (zero), VPG health went red, and I received an alert stating “The Zerto Virtual Manager is not connected to site Prod_Site…”

The Zerto Virtual Manager is not connected to site Prod_Site

 

Great, so the protected site ZVM is down now and the recovery site ZVM noticed.  The next step for me was to verify that despite the ZVM being down, the VRA continued to replicate my workload.  To prove this, I opened the file server and copied the fonts folder (C:\Windows\Fonts) to C:\Temp (total size of data ~500MB).

As the copy completed, I then opened the performance tab of the sending VRA and went straight to see if the network transmit rate went up, indicating data being sent:

VRA Performance in vSphere, showing data being transmitted to remote VRA in protected site.

Following that, I opened the performance monitor on the receiving VRA and looked at two stats: Data receive rate, and Disk write rate, both indicating activity at the same timeframe as the sending VRA stats above:

Data receive rate (Network) on receiving/recovery VRA Disk write rate on receiving/recovery VRA

As you can see, despite the ZVM being down, replication continues, with caveats though, that you need to be aware of:

  • No new checkpoints are being created in the journal
  • Existing checkpoints up to the last one created are all still recoverable, meaning you can still recover VMs (VPGs), Sites, or files.

Even if replication is still taking place, you will only be able to recover to the latest (last recorded checkpoint) before the ZVM went down.  When the ZVM returns, checkpoints are once again created, however, you will not see checkpoints created for the entire time that ZVM was unavailable.  In my testing, the same was true for if the recovery site ZVM went down while the protected site ZVM was still up.

How does the ZVM being down affect checkpoint creation?

If I take a look at the Journal history for the target workload (file server), I can see that since the ZVM went away, no new checkpoints have been created.  So, while replication continues on, no new checkpoints are tracked due to the ZVM being down, since one of it’s jobs is to track checkpoints.

Last checkpoint created over 30 minutes ago, right before the ZVM was powered off.

 

What can be recovered while the ZVM is down?

Despite no new checkpoints being created – FOT or FOL – VPG Clone, Move, and File Restore services are still available for the existing journal checkpoints.  Given this was something I’ve never tested before, this was really impressive.

One thing to keep in mind though is that this will all depend on how long your Journal history is configured for, and how long that ZVM is down.  I provide more information about this specific topic further down in this article.

What happens when the ZVM is returned to service?

So now that I’ve shown what is going on when the ZVM is down, let’s see what happens when it is back in service.  To do this, I just need to power it back up, and allow the services to start, then see what is reported in the ZVM UI on either site.

As soon as all services were back up on the protected site ZVM, the recovery site ZVM alerted that a Synchronization with site Prod_Site was initiated:

Synchronizing with site Prod_Site

Recovery site ZVM Dashboard during site synchronization.

The next step here is to see what our checkpoint history looks like.  Taking a look at the image below, we can see when the ZVM went down, and that there is a noticeable gap in checkpoints, however, as soon as the ZVM was back in service, checkpoint creation resumed, with only the time during the outage being unavailable.

Checkpoints resume

 

What happens if the ZVM is down longer than the configured Journal History setting?

In my lab, for the above testing, I set the VPG history to 1 hour.  That said, if you take a look at the last screen shot, older checkpoints are still available (showing 405 checkpoints).  When I first tried to run a failover test after this experiment, I was presented with checkpoints that go beyond an hour.  When I selected the oldest checkpoint in the list, a failover test would not start, even if the “Next” button in the FOT wizard did not gray out.  What this has lead me to believe is that it may take a minute or two for the journal to be cleaned up.

Because I was not able to move forward with a failover test (FOT), I went back in to select another checkpoint, and this time, the older checkpoints were gone (from over an hour ago).  Selecting the oldest checkpoint at this time, allowed me to run a successful FOT because it was within range of the journal history setting.  Lesson learned here – note to self: give Zerto a minute to figure things out, you just disconnected the brain from the spine!

Updated Checkpoints within Journal History Setting

Running a failover test to validate successful usage of checkpoints after ZVM outage:

File Server FOT in progress, validating fonts folder made it over to recovery site.

And… a recovery report to prove it:

Recovery Report - Successful FOT Recovery Report - Successful FOT

 

Summary and Next Steps

So in summary, Zerto is self-healing and can recover from a ZVM being down for a period of time.  That said, there are some things to watch out for, which include known what your configured journal setting is, and how a ZVM being down longer than the configured history setting affects your ability to recover.

You can still recover, however, you will start losing older checkpoints as time goes on while the ZVM is down.  This is because of the first-in-first-out (FIFO) nature of how the journal works.  You will still have the replica disks and journal checkpoints committing to it as time goes on, so losing history doesn’t mean you’re lost, you will just end up breaching your SLA for history, which will re-build over time as soon as the ZVM is back up.

As a best practice, it is recommended you have a ZVM in each of your protected sites, and in each of your recovery sites for full resilience.  Because after all, if you lose one of the ZVMs, you will need at least either the protected or recovery site ZVM available to perform a recovery.  The case is different if you have a single ZVM.  If you must have a single ZVM, put it into the recovery site, and not on the protected site, because chances are, your protected site is what you’re accounting for going down in any planned or unplanned event.  It makes most sense to have the single ZVM in the recovery site.

In the next article, I’ll be exploring this very example of a single ZVM and how that going down affects your resiliency.  I’ll also be testing some ways to potentially protect that single ZVM in the event it is lost.

Thanks for reading!  Please comment and share, because I’d like to hear your thoughts, and am also interested in hearing how other solutions handle similar outages.

Share This:
Share

Zerto Automation with PowerShell and REST APIs

Zerto is simple to install and simple to use, but it gets better with automation!  While performing tasks within the UI can quickly become second nature, you can quickly find yourself spending a lot of time repeating the same tasks over and over again.  I get it, repetition builds memory, but it gets old.  As your environment grows, so does the amount of time it takes to do things manually.  Why do things manually when there are better ways to spend your time?

Zerto provides great documentation for automation via PowerShell and REST APIs, along with Zerto Cmdlets that you can download and install to add-on to  PowerShell to be able to do more from the CLI.  One of my favorite things is that the team has provided functional sample scripts that are pretty much ready to go; so you don’t have to develop them for common tasks, including:

  • Querying and Reporting
  • Automating Deployment
  • Automating VM Protection (including vRealize Orchestrator)
  • Bulk Edits to VPGs or even NIC settings, including Re-IP and PortGroup changes
  • Offsite Cloning

For automated failover testing, Zerto includes an Orchestrator for vSphere, which I will cover in a separate set of posts.

To get started with PowerShell and RESTful APIs, head over to the Technical Documentation section of My Zerto and download the Zerto PowerShell Cmdlets (requires MyZerto Login) and the following guides to get started, and stay tuned for future posts where I try these scripts out and offer a little insight to how to run them, and also learn how I’ve used them!

  • Rest APIs Online Help – Zerto Virtual Replication
    • The REST APIs provide a way to automate many DR related tasks without having to use the Zerto UI.
  • REST API Reference Guide – Zerto Virtual Replication
    • This guide will help you understand how to use the ZVR RESTful APIs.
  • REST API Reference Guide – Zerto Cloud Manager
    • This guide explains how to use the ZCM RESTful APIs.
  • PowerShell Cmdlets Guide – Zerto Virtual Replication
    • Installation and use guide for the ZVR Windows PowerShell cmdlets.
  • White Paper – Automating Zerto Virtual Replication with PowerShell and REST APIs
    • This document includes an overview of how to use ZVR REST APIs with PowerShell to automate your virtual infrastructure.  This is the document that also includes several functional scripts that take the hard work out of everyday tasks.

If you’ve automated ZVR using PowerShell or REST APIs, I’d like to hear how you’re using it and how it’s changed your overall BCDR strategy.

I myself am still getting started with automating ZVR, but am really excited to share my experiences, and hopefully, help others along the way!  In fact, I’ve already been working with bulk VRA deployment, so check back or follow me on twitter @EugeneJTorres for updates!

Share This:
Share

Changing a VM’s Recovery VRA When a Host Crashes

Yesterday, we had one host in our recovery site PSOD, and that caused all kinds of errors in Zerto, primarily related to VPGs.  In our case, this particular host had both inbound and outbound VPGs attached to it’s VRA, and we were unable to edit (edit button in VPG view was grayed out, along with the “Edit VPG” link when clicking in to the VPG) any of them to recover from the host failure.  Previously when this would happen, we would just delete the VPG(s) and recreate it/them, preserving the disk files as pre-seeded data.

When you have a few of these to re-do, it’s not a big deal, however, when you have 10 or more, it quickly becomes a problem.

One thing that I discovered that I didn’t know was in the product, is that if you click in to the VRA associated with the failed host, and go do the MORE link, there’s an option in there to “Change Recovery VRA.”  This option will allow you to tell Zerto that anything related to this VRA should now be pointed at X. Once I did that, I was able to then edit the VPGs.  I needed to edit the VPGs that were outbound, because they were actually reverse-protected workloads that were missing some configuration details (NIC settings and/or Journal datastore).

Here’s how:

  1. Log on to the Zerto UI.
  2. Once logged on, click on the Setup tab.

  3. In the “VRA Name” column, locate the VRA associated with the failed host, and then click the link (name of VRA) to open the VRA in a new tab in the UI.

  4. Click on the tab at the top that contains VRA: Z-VRA-[hostName].
  5. Once you’re looking at the VRA page, click on the MORE link.

  6. From the MORE menu, click Change VM Recovery VRA.

  7. In the Change VM Recovery VRA dialog, check the box beside the VPG/VM, then select a replacement host. Once all VPGs have been udpated, click Save.

Once you’ve saved your settings, validate that the VPG can be edited, and/or is once again replicating.

 

Share This:
Share

ESXi 6.0 U2 Host Isolation Following Storage Rescan

Following an upgrade to ESXi 6.0 U2, this particular issue has popped up a few times, and while we still have a case open with VMware support in an attempt to understand root cause, we have found a successful workaround that doesn’t require any downtime for the running workloads or the host in question.  This issue doesn’t discriminate between iSCSI or Fibre Channel storage, as we’ve seen it in both instances (SolidFire – iSCSI, IBM SVC – FC).  One common theme with where we are seeing this problem is that it is happening in clusters with 10 or more hosts, and many datastores.  It may also be helpful to know that we have two datastores that are shared between multiple clusters.  These datastores are for syslogs and ISOs/Templates.

 

Note: In order to perform the steps in this how-to, you will need to already
have SSH running and available on the host, or access to the DCUI.

Observations

  • Following a host or cluster storage rescan, an ESXi host(s) stops responding in vCenter and still has running VMs on it (host isolation)
  • Attempts to reconnect the host via vCenter doesn’t work
  • Direct client connection (thick client) to host doesn’t work
  • Attempts to run services.sh from the CLI causes script to hang after “running sfcbd-watchdog stop“.  The last thing on the screen is “Exclusive access granted.”
  • The /var/log/vmkernel.log displays the following at this point: “Alert: hostd detected to be non-responsive

Troubleshooting

The following troubleshooting steps were obtained from VMware KB Article 1003409

  1. Verify the host is powered on.
  2. Attempt to reconnect the host in vCenter
  3. Verify that the ESXi host is able to respond back to vCenter at the correct IP address and vice versa.
  4. Verify that network connectivity exists from vCenter to the ESXi host’s management IP or FQDN
  5. Verify that port 903 TCP/UDP is open between the vCenter and the ESXi host
  6. Try to restart the ESXi management agents via DCUI or SSH to see if it resolves the issue
  7. Verify if the hostd process has stopped responding on the affected host.
  8. verify if the vpxa agent has stopped responding on the affected host.
  9. Verify if the host has experienced a PSOD (Purple Screen of Death).
  10. Verify if there is an underlying storage connectivity (or other storage-related) issue.

Following these troubleshooting steps left me at step 7, where I was able to determine if hostd was responding on the host.  The vmkernel.log further supports this observation.

Resolution/Workaround Steps

These are the steps I’ve taken to remedy the problem without having to take the VMs down or reboot the host:

  1. Since the hostd service is not responding, the first thing to do is run /etc/init.d/hostd restart from a second SSH session window (leaving the first one with the hung services.sh restart script process).
  2. While running the hostd restart command, the hung session will update, and produce the following:

  3. When you see that message, press enter to be returned to the shell prompt.
  4. Now run /etc/init.d/vpxa restart, which is the vCenter Agent on the host.
  5. After that completes, re-run services.sh restart and this time it should run all the way through successfully.
  6. Once services are all restarted, return to the vSphere Web Client and refresh the screen.  You should now see the host is back to being managed, and is no longer disconnected.
  7. At this point, you can either leave the host running as-is, or put it into maintenance mode (vMotion all VMs off).  Export the log bundle if you’d like VMware support to help analyze root cause.

 

I hope you find this useful, and if you do, please comment and share!

Share This:
Share

Zerto: Dual NIC ZVM

Something I recently ran into with Zerto (and this can happen for anything else) was the dilemma of being able to protect remote sites that (doesn’t happen often) happen to have IP addresses that are identical in both the protected and recovery sites.  And no, this wasn’t planned for, it was just discovered during my Zerto deployment in what we’ll call the protected sites.

Luckily, our network team had provisioned two new networks that are isolated, and connected to these protected sites via MPLS.  Those two new networks do not have the ability to talk back to our existing enterprise network without firewalls getting involved, and this is by design since we are basically consolidating data centers while absorbing assets and virtual workloads from a recently acquired company.

When I originally installed the ZVM in my site (which we’ll call the recovery site), I had used IP addresses for the ZVM and VRAs that were part of our production network, and not the isolated network set aside for this consolidation.  Note: I installed the Zerto infrastructure in the recovery site ahead of time before discussions about the isolated networks was brought up.  So, because I needed to get this onto the isolated network in order to be able to replicate data from the protected sites to the recovery site, I set out to re-IP the ZVM, and re-IP the VRAs.  Before I could do that, I needed to provide justification for firewall exceptions in order for the ZVM in the recovery site to link to the vCenter, communicate with ESXi hosts for VRA deployment, and also to be able to authenticate the computer, users, service accounts in use on the ZVM.  Oh, and I also needed DNS and time services.

The network and security teams asked if they could NAT the traffic, and my answer was “no” because Zerto doesn’t support replication using NAT.  That was easy, and now the network team had to create firewall exceptions for the ports I needed.

Well,  as expected, they delivered what I needed.  To make a long story short, it all worked, and then about 12 hours before we were scheduled to perform our first VPG move, it all stopped working, and no one knew why.  At this point, it was getting really close to us pulling the plug on the migration the following day, but I was determined to get this going and prevent another delay in the project.

When looking for answers, I contacted my Zerto SE, reached out on twitter, and also contacted Zerto Support.  Well, at the time I was on the phone with support, we couldn’t do anything because communication to the resources I needed was not working.  We couldn’t perform a Zerto re-configure to re-connect to the vCenter, and at this point, I had about 24VPGs that were reporting they were in sync (lucky!), but ZVM to ZVM communication wasn’t working, and recovery site ZVM was not able to communicate with vCenter, so I wouldn’t have been able to perform the cutover.  So since support couldn’t help me out in that instance, I scoured the Zerto KB looking for an alternate way of configuring this where I could get the best of both worlds, and still be able to stay isolated as needed.

I eventually found this KB article that explained that not only is it supported, but it’s also considered a best practice in CSP or large environments to dual-NIC the ZVM to separate management from replication traffic.  I figured, I’m all out of ideas, and the back-and-forth with firewall admins wasn’t getting us anywhere; I might as well give this a go.  While the KB article offers the solution, it doesn’t tell you exactly how to do it, outside of adding a second vNIC to the ZVM.  There were some steps missing, which I figured out within a few minutes of completing the configuration.  Oh, and part of this required me to re-IP the original NIC back to the original IP I used, which was on our production network.  Doing this re-opened the lines of communication to vCenter, ESXi hosts, AD, DNS, SMTP, etc, etc… Now I had to focus on the vNIC that was to be used for all ZVM to ZVM as well as replication traffic.  In a few short minutes, I was able to get communication going the way I needed it, so the final thing I needed to do was re-configure Zerto to use the new vNIC for it’s replication-related activities.  I did that, and while I was able to re-establish the production network communications I needed, now I wasn’t able to access the remote sites (ZVM to ZVM) or access the recovery site VRAs.

It turns out, what I needed here were some static, persistent routes to the remote networks, configured to use the specific interface I created for it.

Here’s how:

The steps I took are below the image.  If the image is too small, consider downloading the PDF here.

zerto_dual_nic_diagram

 

On the ZVM:

  1. Power it down, add 2nd vNIC and set it’s network to the isolated network.  Set the primary vNIC to the production network.
  2. Power it on.  When it’s booted up, log in to Windows, and re-configure the IP address for the primary vNIC.  Reboot to make sure everything comes up successfully now that it is on the correct production network.
  3. After the reboot, edit the IP configuration of the second vNIC (the one on the isolated network).  DO NOT configure a default gateway for it.
  4. Open the Zerto Diagnostics Utility on the ZVM. You’ll find this by opening the start menu and looking for the Zerto Diagnostics Utility.  If you’re on Windows Server 2008 or 2012, you can search for it by clicking the start menu and starting to type “Zerto.”
    zerto_dual_nic_1_4
  5. Once the Zerto Diagnostics Utility loads, select “Reconfigure Zerto Virtual Manager” and click Next.
    zerto_dual_nic_1_5
  6. On the vCenter Server Connectivity screen, make any necessary changes you need to and click Next.  (Note: We’re only after changing the IP address the ZVM uses for replication and ZVM-to-ZVM communication, so in most cases, you can just click Next on this screen.)
  7. On the vCloud Director (vCD) Connectivity screen, make any necessary changes you need to and click Next. (Note: same note in step 6)
  8. On the Zerto Virtual Manager Site Details screen, make any necessary changes you need to  and click Next. (Note: same as note in step 6)
  9. On the Zerto Virtual Manager Communication screen, the only thing to change here is the “IP/Host Name Used by the Zerto User Interface.”  Change this to the IP Address of your vNIC on the isolated Network, then click Next.zerto_dual_nic_1_9
  10. Continue to accept any defaults on following screens, and after validation completes, click Finish, and your changes will be saved.
  11. Once the above step has completed, you will now need to add a persistent, static route to the Windows routing table.  This will tell the ZVM that for any traffic destined for the protected site(s), it will need to send that traffic over the vNIC that is configured for the isolated network.
  12. Use the following route statement from the Windows CLI to create those static routes:
    route ADD [Destination IP] MASK [SubnetMask] [LocalGatewayIP] IF [InterfaceNumberforIsolatedNetworkNIC] -p
    Example:>
    route ADD 192.168.100.0 MASK 255.255.255.0 10.10.10.1 IF 2 -p
    route ADD 102.168.200.0 MASK 255.255.255.0 10.10.10.1 IF 2 -p
    
    Note: To find out what the interface number is for your isolated network vNIC, run route print from the Windows CLI.  It will be listed at the top of what is returned.
    

 

zerto_dual_nic_1_10

Once you’ve configured your route(s), you can test by sending pings to remote site IP addresses that you would normally not be able to see.

After performing all of these steps, my ZVMs are now communicating without issue and replications are all taking place.  A huge difference from hours before when everything looked like it was broken.  The next day, we were able to successfully move our VPGs from protected sites to recovery sites without issue, and reverse protect (which we’re doing for now as a failback option until we can guarantee everything is working as expected).

If this is helpful or you have any questions/suggestions, please comment, and please share! Thanks for reading!

 

Share This:
Share

Protecting a VM with vSphere Replication

Continuing on from the previous blog about configuring array-based replication with SRM, in this blog post we’ll be going through configuring protection of a VM using vSphere Replication.  The reason I’m doing this instead of jumping right into creating the protection groups and recovery plans is because vSphere Replication can function on its own without SRM.  That said, we’ll go through the steps to protect a virtual workload using vSphere Replication, and follow this up with creating protection groups and recovery plans, which come into play in either situation (ABR vs vR) when we get to the orchestration functionality that SRM brings to the table.

vSphere Replication is included with VMware Essentials plus and above, so chances are you have this feature available to you to should you decide to use it to protect VMs using hypervisor-based replication.  In my experience, vSphere Replication works great and can be used to either migrate or protect virtual workloads, however, as stated above, can be limited.  See this previous post for the details of what vSphere Replication can and can’t do without Site Recovery Manager.

 

Procedure

In this walkthrough for protecting a VM using vSphere Replication, I will be performing the steps using a decently sized Windows VM as the asset that needs protection.  This VM is a plain installation of Windows, however, I use the fsutil to generate files of different sizes to simulate data change.

    1. In your vSphere Web Client, locate a VM that you wish to protect via hypervisor-based replication.
    2. Right-click on the VM and go to All vSphere Replication Actions > Configure Replication.how-to_vspherereplication_1_2
    3. When the wizard loads, the first screen asks for the replication type.  Select Replicate to a vCenter Server, and click Next.how-to_vspherereplication_1_3
    4. Select the Target Site and click Next.how-to_vspherereplication_1_4
    5. Select the remote vSphere Replication server (or if you only have 1, then select auto-assign), wait for validation, then click Next.how-to_vspherereplication_1_5
    6. On the target location screen, there are several options to configure, so we’ll go through each one by one:- Expand the settings by clicking the arrow next to the VM, or click the info link.how-to_vspherereplication_1_6_a– Click edit in the area labeled Target VM Location, select the target datastore and location for the recovery VM, then click OK to be returned to the previous screen.how-to_vspherereplication_1_6_b– Typically, the previous step would be enough, however, if you want to place VMDKs in specific datastores, edit their format (thick vs. thin provisioned), or assign a policy, use the edit links beside each hard disk.  Once all your settings are how you want them, click Next.

      how-to_vspherereplication_1_6_c

    7. Specify your replication options, then click Next.
      Notes:
      - Enable quiescing if your guest OS supports it, however, keep in mind
        that quiescing may affect your RPO times.
      - Enable network compression to reduce required bandwidth and free up
        buffer memory on the vSphere Replication server, however, higher CPU
        usage may result, so it is best to test with both options to see what
        works best in your environment.
      

      how-to_vspherereplication_1_7

    8. Configure RPO to meet customer requirements, enable point in time instances (snapshots in time as recovery points – maximum of 24) if needed, then click Next.
    9. Review your configuration summary, make changes if necessary, but when you’re done, click Finish.  As soon as you finish, a full sync will be initiated.

There you go, configuring vSphere replication for a VM.  The next post will cover creating protection groups and recovery plans, which we will then tie into what we’ve just performed here and with the array-based replication post.

Share This:
Share

VMware SRM 6.1 – Configure Array-Based Replication

Introduction

 

This how-to will walk through the installation and configuration of array-based replication features for VMware Site Recovery Manager 6.1.

Before configuring array-based replication for use with VMware SRM, there are some pre-requisites.  First of all, you’re going to need to visit the VMware Compatibility Guide, which will help you determine if your specific array vendor is supported for use with SRM.  Second, there are steps to take to configure array based replication on the storage side, and that portion is out-of-scope for this blog, as I did not have access to do so.

vmware_hcl_example

There are several ways to search the compatibility guide, but to be specific, you can select entries from the areas highlighted above.  The bottom section that is highlighted will be your results once you click “Update and View Results.”  The reason why I wanted to point this step out is because if you assume your array vendor is supported, and don’t verify first, you could end up wasting your time planning and designing.

For this example, we are using SRM 6.1 with the Fibre Channel protocol on IBM SVC-fronted DS8K’s in both sites. I wanted to point that out because when I first set out to find the SRAs for use with our solution, I attempted to use the “IBM DS8000 Storage Replication Adapter”, later to find out it wasn’t the correct one.   The correct SRA for use with my environment is the “IBM Storwize Family Storage Replication Adapter”, so there may be a little bit of trial and error with this; however, if you do it up front during testing, you’ll save yourself some time later when deploying to production.

That all said, once you’ve verified your storage is supported, and what version of the SRA to download, you can get it by visiting the VMware downloads (you will need to login).  Be sure to also verify that the version of the SRA you are downloading is compatible with the version of array manager code you’re running.

 

Installing the SRA

Before you Begin – Prior to installing the SRA on the SRM server in each site (protected and recovery), you should have already paired the sites successfully.  Also, if you haven’t installed SRM yet, you will need to, otherwise the SRA installer will fail once it discovers that SRM is not installed.

Installing the SRA should be straightforward and painless, as there are not many options to configure during installation.  Once the installation is completed on both the protected and recovery SRM servers, proceed.

 

Verify That SRM Has Registered the SRAs

  1. Once you’ve installed the SRA on each site’s SRM server, log into the vSphere Web Client, and go to Site Recovery > Sites and select a site.site_recovery_sites_sra_monitor
    From this view, you can see what SRA has been installed, its status, and compatibility information.
  2. Click the rescan button to ensure the connection is valid and there are no errors.srm_sra_rescan_button

Configure Array Managers

After pairing the protected and recovery sites, you will need to configure the respective array managers so SRM can discover replicated devices, compute datastore groups, and initiate storage operations.  You typically only need to do this once, however, if array access credentials change, or you want to use a different set of arrays, you can edit the connections to update accordingly.

Pre-Requisites

  • Sites have been paired and are connected
  • SRAs have been installed at both sites and verified

Procedure

  1. In the vSphere Web Client, go to Site Recovery > Array Based Replication.srm_abr_settings_1_1
  2. On the Objects tab in the right window pane, click the icon to add an array manager.srm_abr_settings_1_2
  3. Select from one of two options for adding array managers (pair or single), then click Next.srm_abr_settings_1_3
  4. Select a pair of sites for the array manager(s), and click Next.srm_abr_settings_1_4
  5. Enter a name for the array in the Display Name field, and click Next.srm_abr_settings_1_5
  6. Provide the required information for the type of SRA you selected, and click Next.srm_abr_settings_1_6
  7. If you chose to add a pair of array managers, enter the paired array manager information, then click Next.srm_abr_settings_1_7
  8. Click-to-enable the checkbox beside the array pair you just configured, and click Next.srm_abr_settings_1_8
  9. Review your configuration, then click Finish when ready.srm_abr_settings_1_9

 

Rescan Arrays to Detect Configuration Changes

SRM performs an automatic rescan every 24 hours by default to detect any changes made to the array configurations.  It is recommended to perform a manual rescan following any changes to either site by way of reconfiguration or adding/removing devices to recompute the datastore groups.  If you need to change the default interval at which SRM performs a rescan, you can do this in the advanced settings for each site, editing the storage.minDsGroupComputationInterval advanced setting:

srm_abr_settings_1_11

To perform a manual rescan after making any configuration changes:

  1. Go to Site Recovery  > Array Based Replication
  2. Select an array for either site
  3. On the Manage tab of the selected array, click the Array Pairs sub tab
  4. Click the rescan button to perform a manual rescan.srm_abr_settings_1_10

 

Once you’ve got all of the above configured, you can begin setting up your protection groups and recovery plans.

Share This:
Share

Zerto: Perform a VPG Move (VM Migration)

In a situation where a workload needs to be migrated from a protected to a recovery (or site A to site B) in an effort to change where the production workload runs from, you can perform a VPG move.

From what I’ve seen, in terms of VPG move versus Failover, is that when using the Failover option, there is an assumption that the protected site has failed, so systems may not automatically be cleaned up on the protected site.  When performing a move, the protected site is cleaned up as soon as that move is completed and committed unless you select to re-protect the workload in the other direction (can be automatic or manual for commit, maximum time you have to do it is 24 hours, and that is configurable).

One recommendation I have here is that before you perform these steps, perform a recovery test on the VPG you’d like to move to ensure that recovery steps are completed as expected, and that the system is usable at least in a testing capacity.

  1. Log in to the Zerto UI
  2. From the dashboard screen, go to Actions > Move VPG.zerto_perform_vpg_move_1_2
  3. Select (tick the checkbox) for the VPG you want to move, and click Next.zerto_perform_vpg_move_1_3
  4. Select your options for the Execution Parameters, and click Next.  For this example, I will select “none” for the commit policy, to demonstrate where to commit the migration task when you are ready to.zerto_perform_vpg_move_1_4
    > Commit Policy: Auto-Commit - you can delay up to 24 hours (specified in minutes), or select 0 
    to automatically commit immediately when the migration process is completed.
    > Commit Policy: Auto-Rollback - You can delay up to 24 hours (specified in minutes), default 
    delay is 10 minutes
    > Commit Policy: None - You must manually select whether or not to commit or rollback, based 
    on your results.
    > Force Shutdown - Use this in the event VMware Tools isn't running, therefore, allowing an 
    automatic shutdown. Force shutdown will first attempt to gracefully shut the VM down, and if that doesn't work, 
    it will power off the VM on the protected site.
    > Reverse Protection - This will automatically sync changes from the recovery site back to the 
    protected site in case you want to be able to re-protect a system after a migration. This eliminates the need 
    to have to re-initialize synchronization in the other direction. If reverse protection is selected, a delta 
    sync will take place to re-protect after the migration is completed. Caveat - You cannot 
    re-protect if you select "NONE" as the commit policy.
    > Boot Order -(Defined in VPG Configuration, but displayed here)
    > Scripts - (Defined in VPG configuration, but displayed here)
    
  5. Review the summary, and when ready, click Start Move.
    During promotion of data, you cannot move a VM to another host.  If the host is rebooted
    during promotion, make sure the VRA on the host is running and communicating with the ZVM before 
    starting up the recovered VMs.

    zerto_perform_vpg_move_1_5

  6. Since we have selected a commit policy of “none”, once the migration is ready for completion, the Zerto UI will alert you letting you know there is a task awaiting input.  Click on the area highlighted below.zerto_perform_vpg_move_1_6_aSelect to either Commit (checkmark), or Rollback (undo Arrow):

    zerto_perform_vpg_move_1_6_b

  7. At this point, you can also choose whether or not to reverse-protect.  Make your selection and click Commit.zerto_perform_vpg_move_1_7_aThe task will update as seen below:zerto_perform_vpg_move_1_7_b

    Once you commit the move, the data in the protected site is then deleted, thus completing the migration.

Share This:
Share

Zerto: Create a Virtual Protection Group (VPG)

This blog is the next step following the creation/deployment of the VRAs.

To begin protecting virtual machines, you will need to configure virtual protection groups (VPGs).  A virtual protection group is is an affinity grouping of VMs that make up an application.  VPGs can contain 1 or more virtual machines, and contain all the protection settings required which include:

  • Boot Order
  • re-IP settings for testing and recovery
  • Resource mappings
  • Offsite backup
  • Journaling
  • Re-protection settings

Once a VPG is configured, initial synchronization of the protected virtual machines begins to take place, and once synced, will continuously be protected.

Important:

When performing failover, ALL VMs in the VPG will be failed over, and you are not able to select 
specific VMs within the group to be recovered.

Tips

  • For granular protection and failover capabilities, VPGs can be set up containing single VMs, if your migration/failover plan requires being able to pick and choose systems to recover in an order you specify, when not all involved VMs need to be migrated or failed over.
  • Do not group ALL virtual machines into 1 VPG, as performing a recovery will attempt to recover everything contained within the VPG and in some cases, that’s not the best idea.
  • Whenever possible, group servers that depend on each other or make up an application together. This will allow you to make use of boot options, order, or delay to bring them up in the correct order. This will also prevent missing crucial application servers during recovery or migration.
  • Make use of the test feature for DR testing by setting up an isolated VLAN/portgroup which will allow live testing without impacting production.
  • Make use of the re-IP feature to automate any IP address change that needs to happen either on the test network or recovery network.

VPG Creation

  1. Log in to the Zerto UI
  2. Go to the VPGs tab, and click New VPG.create_vpg_1_2
  3. Specify a name for the VPG and set the priority, then click Next.
    In VPGs with different priorities, updates for the VPG(s) with the highest 
    priorities are transferred over the WAN before others.
    
    

    create_vpg_1_3

  4. Select the VM(s) you want to include in this VPG, press the right-arrow to move to selected VMs, then click Next.
    Using the search box in the "Available VMs" window will help you minimize the 
    number of VMs listed and focus only on the one(s) you're looking for.
    Zerto uses the SCSI protocol, so only VMs with disks that are configured/support 
    SCSI can be selected to be part of a VPG.
    
    

    create_vpg_1_4_a

    create_vpg_1_4_b

  5. Specify the recovery site and values to use for replication to the site, then click Next.create_vpg_1_5
  6. Specify the storage requirements for this VM and click Next.
    If you have pre-seeded the volumes, check the box beside the disks 
    and click the Edit Selected link.  Select Preseeded Volume, then browse to the VMDK 
    for that volume.  Repeat for any additional disks that you have pre-seeded.  This 
    is recommended if your VM is large, and has a high rate of change, or the WAN link 
    is shared and bandwidth is limited.

    create_vpg_1_6

  7. Specify the failover/move network (the newtwork that the recovered VM will run on), the recovery folder, any scripts, and click Next.
    Failover Test Network is optional, but recommended if you will be testing 
    failover prior to committing.
    

    create_vpg_1_7

  8. Enter the NIC details to use for the recovered VM, and click Next.
    In some cases, if you're replicating within the same vCenter or cluster, you 
    may end up with a duplicate MAC address warning when recovering, so to avoid this, you 
    can create a new MAC address on the recovery VM during recovery.  In any case, you 
    can also re-IP the VMs as part of the recovery procedure.  To view these 
    settings, check the box beside the VM(s) and click the Edit Selected link.

    create_vpg_1_8

  9. Select whether or not you want to create an offsite backup that can be stored for up to a year, then click Next.  If you don’t need to create a backup, leave this screen at the defaults, then click Next.
    For more information on backups with Zerto, refer to the help file 
    (click the ? button at the tope right of this window), or see the Zerto Virtual 
    Manager Administration Guide.

    create_vpg_1_9

  10. Review VPG settings summary, and if you don’t need to go back and make any changed, click Done.create_vpg_1_10

 

Share This:
Share

Zerto: Deploy Virtual Replication Appliances

If you’ve followed along with Zerto: ZVM Installation, this entry is a continuation, and provides steps to deploying the Zerto Virtual Replication Appliances.

After installation has succeeded, open a browser, and connect to https://ZVMFQDN:9669/zvm.

Notes:

  • If this VM lives in a protected network for management/utility servers, you might need to allow port 9669 from your local network to the network the ZVM lives in.  The Zerto Standalone UI, vCenter Web Client, and vCenter C# client all use port 9669 to access the ZVM.
  • Be sure to use a supported browser.  Chrome, Firefox, and IE 11+ are recommended by Zerto.
  1. Log on using your vCenter credentials.zerto_vra_deploy_1_1
  2. Enter a license key and click Start.

After entering the license key and clicking start, you’re taken to the dashboard, however, before starting to protect VMs, the VRAs will need to be installed on the hosts in the site and pair the protected and recovery sites.

Install the VRAs

The Zerto installation includes the OVF template for VRAs.  A VRA must be installed on every host that manages protected VMs in the protected site, and on every host that will manage VMs in the recovery site.

The VRA compresses data that is passed across the WAN from the protected to recovery site, and automatically adjusts the compression level according to the CPU usage, totally disabling it if required.

A VRA can manage a maximum of 1500 volumes, whether they are protected or not.

VRA Requirements

Each VRA must have:

  • 12.5GB datastore space
  • at least 1GB of reserved memory
  • Each host installed to must be at least ESX/ESXi 4.0 U1 and have ports 22 and 443 enabled for the duration of the installation.

If you are installing to ESXi 5.5 or higher, the VRA should connect to the host with user credentials, otherwise, the password for the host root account is required.  Because of the method used when the VRA connects to the host using a VIB (ESXi 5.5 or higher), it is not necessary to enter the root password.

During VRA deployment, you should have IP addresses reserved, as it is not recommended to use DHCP; so be sure to also have the information for the subnet mask, and default gateway.

If you do not have SSH enabled on your hosts, the ZVM will attempt to enable and disable it during the installation of the VRA.

Important: Do not snapshot a VRA, as it will cause problems with replication!  I actually
forgot to exclude the VRAs from backups, and CommVault attempted to back them up after I had
configured my first VPG, and I ended up having to re-deploy the VRAs.  My advice is to create a
folder for the VRAs in your vCenter folder structure and have that folder excluded from backups
altogether.  Don't forget to move the VRAs into the folder as soon as they're deployed.

Installation

  1. Log in to the Zerto Manager UI
  2. Click on the Setup tab.zerto_vra_deploy_2_2
  3. Locate the host you want to deploy the VRA to, and check the box beside it.  Once you have selected the host, click New VRA.
    Note:  If you select multiple hosts, clicking the New VRA link
    will only install on the first host that you have selected.

    zerto_vra_deploy_2_3

  4. Specify the host, datastore, network, RAM, group, and enter the network details, then click Install.  Repeat the steps for each additional VRA you need to deploy (one per host).
    Note: When you deploy a VRA, Zerto will automatically reserve the amount of
    memory equal to what you specify in the VRA RAM settings.  This amount of RAM is the maximum buffer
    size for the VRA that is used to buffer IOs written by the protected virtual machines before the
    writes are sent over the network to the recovery VRA.  The recovery VRA also buffers incoming IOs
    until they are written to the journal.  If a buffer becomes full, a Bitmap Sync is performed after
    space is freed up in the buffer.
    The protecting VRA can use up to 90% of its buffer for IOs to send to the recovery VRA, which can
    use up to 75% of its buffer before it is full and requires a bitmap sync.

    zerto_vra_deploy_2_4

  5. After all VRA installations are completed, the setup tab will contain more information for each host that has a VRA installed.zerto_vra_deploy_2_5

Once you’ve completed these steps for each host requiring a VRA, you can create Virtual Protection Groups and start protecting your workloads.

Share This:
Share