How a Poorly-Designed Architecture for ESXi will Undermine Replication

Given VMware’s stated direction of adopting the console-less ESXi architecture, many organizations are assessing the impact of an ESXi-based architecture on their data protection strategies, including backup, replication and recovery. When evaluating a data protection vendor, it is critically important to understand the architecture that is used to support the ESXi framework. While a few vendors support ESXi as a check box item, a poorly conceived architecture can result in limited scalability, significantly longer backup windows, and data vulnerable to loss and theft in the replication process.

Characteristics of A Poorly-Designed Replication Product for ESXi

As an example, the figure below shows the architecture of one poorly-designed replication product on the market today. This particular product uses its main server as the shared single engine for backup, replication and recovery. To operate, the product opens one session between the server and a VM on ESXi that is to be protected . It then utilizes VMware vStorage API (appropriately) to mount the snapshots of those VMs on the server engine itself. Finally, the product opens a corresponding new session - also using vStorage API (inappropriately) - between the server and the remote site to transfer the data.

Figure 1. Single-Threaded Protection Server Limits Usability, Scalability, Performance, and Increases Network Costs

This limited architecture suffers from the problems detailed below:

  1. Limited Usability: Because the main server is being used for every job, it cannot be easily scaled to perform simultaneous jobs. This means that backup jobs, restore jobs and replication jobs are hard to operate at the same time. In this case, the vendor advises customers to "get creative" in their job scheduling, such as by reducing the frequency of replication. Or, by reducing the number of VMs being replicated. The vendor also points out that customers can deploy additional servers and engines, to be able to support more jobs running simultaneously. However, none of these creative options really satisfies the primary requirement, which is to have flexibility in scheduling backup, recovery and replication when you need it.
  2. Single Point of Failure: If the backup engine is being used for replication, then the engine itself must be available all the time. Anytime the engine is overloaded or when there is any hardware or software problem with the server, neither the backup nor the replication process can be run.
  3. Severe Performance Constraints: As the entire data stream for replication must pass through the main server, performance of the replication process is impacted by factors such as server capacity, memory and CPU available to each job. Especially when customers must run intermittent backup and restore jobs along with replication on the same server, they must account for that additional peak load requirement by deploying server with much higher capacity that they otherwise would. This trade-off between performance and cost is direct result of the poor architecture of the solution.
  4. Limited Practical Scalability: In tests of this particular replication product, CPU utilization on the server hit as high as 80% for a single replication job. Also, the limited scalability of the job engine forced replication of the VMs in the environment to take much longer than should have been necessary, delaying the time at which the backup jobs could begin. Scaling the product means adding more main servers - which increases management overhead and costs.
  5. Supportability on WAN Connections: This product uses the vStorage API to connect the remote storage volumes with the replication engine (using a technology that is also known as “hot-add”).The API is designed by VMware for use over local networks. Local networks differ from wide-area networks (WANs) in characteristics such as speed, latency and packet loss. The API is designed to be used in relatively error-free local environments. Because the vStorage API is not designed for use on WAN links, its behavior in high-latency, high packet-loss environment is suspect. When the API is extended in such a way the results are unpredictable. Certainly, network utilization is not optimized, causing more cost and disruption on shared networks than is necessary.
  6. Lack of Compression: Compressed data is much preferred over uncompressed data when replicating, because it reduces the amount of network bandwidth needed to send the data. Compression is also not supported in this solution, further impacting its performance and scalability

Vizioncore's Alternative: A Break-Through Virtual Appliance (VA)-Based Solution

Vizioncore is working on a new, ground-breaking architecture to avoid the pitfalls of the limited, single-threaded architecture which other vendors offer. Vizioncore's innovative architecture will take advantage of virtual appliance (VA) technology to backup, replicate and restore virtual machines on ESXi server platforms. The appliance is being designed to transfer data directly, similar to vRanger Pro’s unique Direct-to-Target architecture. Using VMware vStorage API’s advanced capabilities, the VA will natively access snapshots, enabling fast and efficient data transfer.

The VA -based architecture separates the management of protection jobs from the job engine itself. Management is still performed through a single engine server. But what is different, is that the job engine is located in one or more Virtual Machine appliances distributed throughout the environment. This distributed but centrally managed architecture offers inherent improvements in scalability, reliability, and manageability as compared to the single-threaded alternative.

Scaling this type of architecture is simple, by deploying as many virtual appliances as is needed to support the backup, replication and recovery jobs required - but without sacrificing manageability nor forcing over-provisioning of any single system. Because the vRanger server would not take part in actual backup or replication operation, this solution distributes the workload while still offering the benefits of centralized management. Because the workload is distributed, the central backup server would not be the single point of failure. This will help make the solution much more reliable than any that are available today.