Rapid Recovery, On Standby

In this blog post, I’ll approach some issues pertaining to Data Protection | Rapid Recovery virtual standby machines. It is a subject that requires some groundwork to be properly approached. So, without further ado, let’s begin!

Rapid Recovery is a solution that backs up volumes at the block level, meaning the backup results are volume images. These images can be mounted as physical disks for file restore on the Core (backup server) or on any other computer when using the Local Mount Utility application for Rapid Recovery.

Recovering full volumes is more interesting. Data volumes can be restored on spare volumes or raw disks attached to any machine that has the Rapid Recovery agent installed. If the system drive of a crashed protected server needs to be recovered, a bare-metal recovery (BMR) operation is needed. This means the machine needs to first boot from a boot disk, which behaves like an already protected agent, and the system drive is restored the same as a collection of data volumes.

In order to allow the restored machine to boot from the newly created system drive, some changes need to be made at both the driver and partition levels. In most cases, the system drive contains several volumes needed for the boot-up process. For clarification, a modern Windows server system disk hosts at least three dedicated partitions: Unified Extensible Firmware Interface (UEFI), which is kind of a BIOS replacement; System Reserved, which contains essential boot information; and System Volume, aka the C: drive. Taking into consideration that any GUID partition table (GPT) contains a hidden partition (the Microsoft Reserved Partition), you can see that system drives are pretty complex, and restoring such a drive, especially on hardware different from the original one, may be a complex task as well.

Now, what happens if you need to have a server running, but it crashes due to faulty hardware and you don’t have a spare physical server ready for deployment (but you could squeeze in some storage and computing capacity)? That is simple: Export the desired recovery point to a virtual machine (VM). In practice, the process is similar to a BMR or even a data volume restore. The API for the hypervisor creates the VM, Rapid Recovery restores the system volume and any data volumes that may be needed, and then the process is finished by modifying the newly created virtual system drive to allow the VM to boot. Rapid Recovery makes the process simpler by supporting quite a few hypervisors: VMware ESX(i)/vCenter, VMware Workstation, Microsoft Hyper-V and Oracle VirtualBox.

The next logical step is to have a few critical protected servers exported in such a way that, if they fail, a VM (already created as shown above) is prepared to replace them. That is a virtual standby. To accomplish this, a new export is performed after each backup. If the newest backup snapshot was an incremental, meaning that only the data that differs from the previous backup was transferred over the wire, and this new backup is merged with the virtual standby, then the export process will not be resource intensive. If all goes well, the virtual standby is in sync with the newest backup.

It goes without saying that this is a generous, almost seductive concept. If paired with the fact that virtual standby machines can be spun up at the disaster recovery (DR) site (where data is replicated), it means that theoretically, in case of catastrophic failure, it’s possible to recreate the whole local area network at the DR site (or even in the cloud) and have users work remotely with minimal interruptions. Little wonder that virtual standbys are a cornerstone of each DR plan and — I say it based on facts — literally haunt the dreams of many backup administrators.

However, reality being what it is and making it work in our favor requires a lot of planning ahead and sound resource management.

The first statement I feel bound to make before crossing the separation between fact and fiction in the realm of data recovery is that, indeed, virtual standbys are essential to any DR plan. The second (and last) statement is that virtual standby priorities are different from backup priorities. What is critical from a backup standpoint may be irrelevant from a virtual standby standpoint.

To understand, let’s walk through some real-life situations.

The subject is an engineering company that hosts its essential data on an 8TB volume. This volume is hosted on a dedicated high-end storage area network (SAN) attached to a server running Windows Server 2012 R2. Losing that data means that the company would go out of business. There is no doubt that backing up this server and checking its recovery points is a mandatory operation — and that replicating this data either to a DR location or in the cloud should also be mandatory. Archiving the data at regular intervals is a good option, too. This would cost a lot of money, as storage and high-bandwidth WAN pipes do not come cheap, but there’s no way around it. Should this machine be paired with a virtual standby, too? Also consider there are 20–30 more servers of various sizes that need to be backed up, as well.

Let’s take a look at several scenarios:

  1. Due to the fact that the VM with an 8TB volume in standby mode needs a lot of storage space that can’t be used for anything else, there isn’t enough storage space for other virtual standbys (domain controllers, DNS, DHCP). Not to mention that Hyper-V 2012 R2 or VMware 5.5 or later are needed to export volumes over 2TB. So when disaster strikes, it turns out that these are actually the essential machines. Due to the virtual standby, the machine with the engineering data is up and running in minutes, but no user can connect, and no workstation communicates with anything on the network. The best decision would have been to have virtual standbys ready for the infrastructure servers. At most, only the system drive of the engineering data machine should have been exported in standby mode. To minimize the downtime, the 8TB volume could have been restored using the Live Recovery feature of Rapid Recovery, which gives access to the data while the recovery operation is still running.
  2. The engineering server fails due to an unexpected hardware failure. The rest of the network is intact. The virtual standby is spun up. It turns out that the SAN hosting the data is OK and has current data as opposed to the 8TB exported volume, which is about one hour behind. As such, it’s decided to attach the SAN to the newly spun virtual standby machine.
  3. The engineering server degrades slowly due to file system corruption. When a reboot is attempted, the server fails to boot. The virtual standby, which is about an hour behind the protected machine, fails to boot either. The solution is to find a recovery point that is hosting a system disk image that boots and attach the SAN or the newest working data image to it.
  4. The virtual standby export job fails due to a hypervisor issue. As a result, the next export is a full export (merging the incremental backup snapshot is not possible, so a full export has to be performed). For an 8TB volume, this may take a very long time, sometimes even a couple of days until the exports are back in sync. If disaster strikes, the virtual standby will be either nonfunctional or out of date. In both cases, exporting a recovery point will solve the issue but introduce delays.
  5. Despite all technical constraints, several virtual standbys are ready to get into action. Unfortunately, nobody took into consideration that virtual exports mean that new (virtual) hardware is created. This means that the MAC addresses of the virtual network interface cards (NICS) don’t match those of the original hardware, so there are no static IP addresses, and the IP addresses of the virtual standbys are assigned via DHCP. If the DHCP server is not functional, no network connectivity is available. If it is, new IP addresses need to be assigned manually and, in the heat of the moment, errors are made. And that isn’t all. Due to the ARP cache, no machine on the network will connect to the newly powered-up servers until the ARP cache on each machine on the network is cleared or expires. There’s also the fact that the VM MAC addresses are only partially changeable. For instance, all VMware MAC addresses have the prefix of “00:50:56:”, and only the last 3 bytes can be changed. In this case, the system admin was aware of the issue, wrote down the MAC addresses of the virtual standby VMs and devised an ingenious DHCP operation that allowed him to import the information and convert it into IP reservations of the correct static IPs at the time the virtual standbys were turned on. Unfortunately, at the time of disaster, the DHCP server was not on. A better solution is to use a script that assigns IP addresses at boot time and locate it on the live machine. In this case, all virtual standbys will boot up with the correct IP addresses. An example of such a script is shown in this knowledge base article.

The examples above have a common denominator. From a backup perspective, critical machines are the ones that contain valuable data. From a virtual standby perspective, the critical machines are those that assure the network infrastructure. Please imagine a more complex example: creating a virtual standby for an Exchange server. It’s evident that such a machine is useless without an operational domain controller and Active Directory DNS. Creating virtual standbys of large machines is counterproductive. In the end, it’s the job of each organization to prepare the best DR strategy specific to its goals. Rapid Recovery offers a lot of flexibility in choosing the best combination of features that allow balancing both performance and costs. Among these, if used properly, virtual standbys have a place that cannot be underestimated.

For more information about the technical aspects of virtual standbys, read the Rapid Recovery User Guide