How Rapid Recovery handles replication failures

Jan 20, 2020

Customers often ask Support how Rapid Recovery handles replication failures and cancellations, and how much data will be re-transmitted when a failure occurs.

Terminology

To understand how Rapid Recovery handles replication failures and cancellations, we first need to understand some key vocabulary.

Volume Image: A single backup of a single storage volume. Can be a base image (first backup of that volume in a chain) or an incremental backup (capturing only changes since the prior backup).

Volume Image Chain: A contiguous group of backups of a single volume, spread across multiple recovery points. Each volume image chain starts with one base image of that volume, and includes all incremental backups that rely on that base image. When a new base image is taken, a new chain is started. For example, all of the volume images of the C volume in a set of recovery points make up the volume image chain of C, assuming there is only one base image.

Recovery Point: A group of volume images that were created at the same time as part of the same backup job.

How does Rapid Recovery replication handle failures and cancellations?

Customers often ask Support how Rapid Recovery handles replication failures and cancellations, and how much data will be re-transmitted when a failure occurs.

Rapid Recovery handles replication failures and cancellations identically. Regardless of whether you manually cancel a replication job or the job fails, the behavior of Rapid Recovery is the same.

Rapid Recovery replication performs almost every task at the volume image level. Replication starts on the source Core with the oldest recovery point. One volume image is replicated at a time. As each volume image for that recovery point is successfully replicated, the data is fully committed to the target repository, and becomes available for restore on the target Core. If there is a failure in a replication job after only one of the volume images in a recovery point is successfully replicated, that volume image will not have to be transmitted again. The recovery point appears as a partial recovery point on the target Core (only displaying the successfully replicated volume images) and will become a complete recovery point as soon as the replication job for that agent runs again and finishes transmitting the rest of the volume images for that recovery point.

As replication transmits data from the source Core to the target Core, the data is written to the repository staging area (a logical tracking and collection on the target Core of as-yet uncommitted volume image data). Until the entire volume image is replicated and committed to the target repository, the data written as part of the replication job remains in the staging area of the target, and is considered "uncommitted" to the repository. Once the full data for the volume image successfully transmits, the data is marked as committed, and becomes visible as a volume image in a recovery point. If replication does not complete (due to a canceled job or a failure), all data that has been written to the staging area remains there until it is cleared, or until or replication for that agent runs again and is able to complete the transfer of the volume image.

The staging area is cleared any time a repository is mounted. This occurs when the core service is starting or when a repository check is run. When the staging area is cleared all progress from failed or canceled replication jobs is lost.

When a replication job for an agent runs after a failure or cancellation, the source Core matches up the volume images that exist on the source and target Cores and creates a list of the volume images that still need to be replicated. After it has done this, it displays a size for the replication task. If the previous replication failed without transferring a full volume image, the size of the replication job will be the same as the previous replication job. If the previous replication job was able to replicate at least one volume image, then the new replication job will be the size of the original job minus the size of the volume images that were replicated (since it does not have to replicate already saved volume images).

After identifying which volume images need to be replicated, the source Core validates any data in the target Core repository staging area. Once it has validated the already replicated data, it resumes transmitting new data. The validation process appears in the UI the same as replication of new data. The replication job runs quickly at the start (as it validates already transmitted data), and then slows down once it begins to replicate new data. Data validation requires communication between the two Cores. Thus, as the Cores communicate and validate data, you can see bandwidth being used between the source and target Cores.

Example

Consider as an example a protected server with two volumes, C and D. The C volume has 50 GB in use, and the D volume has 500 GB in use.

When this server is protected, the base image backup shows a size of 550 GB (because Rapid Recovery shows all recovery points using their raw data size). If this server has a 50% compression ratio, then Rapid Recovery actually only consumes 225 GB of data in the repository.
When you establish replication for this server, the replication job shows the base image replication job as 550 GB in size, even though the actual amount of repository space required from the compressed recovery point is 225 GB.
Assume this server uses a 20 Mbps network connection between source and target Cores. When the job starts, it uses all of your bandwidth to begin transferring data (assuming there is no data on the target core). At 20 Mbps, the Core should show a replication speed of about 2.5 MB/s. The total time for the job to replicate at 20 Mbps calculates to approximately 25.6 hours.

225 GB * 1024 = 230,400 MB

230,400 MB * 8 = 1,843,200 Mbits

1,843,200 Mbits / 20 Mbps = 92,160 seconds

92,160 seconds / 60 / 60 = 25.6 hours

NOTE: Please notice the difference in units. Bandwidth is generally measured in bits per second (lowercase "b"). Rapid Recovery uses a measurement of bytes per second (capital "B").

To complete in 25.6 hours, the network must remain available stably, without being saturated or disconnecting.

If you experience a network drop after 20 hours (resulting in a failed replication job), based on 20 Mbps bandwidth, approximately 175 GB of data has transferred. The repository on the target Core shows a recovery point with a C volume image in it. The C drive, which held only 50 GB of raw data, completed replication and committed successfully. However, you will not see a D volume. The replication job failed to complete, and the partially transferred data for the D volume data remains in the staging area. Assuming the absence of an event that clears the staging area, when the replication job runs again, it resumes from where it left off. In the Core UI that job shows a size of 500 GB, since the 50 GB C volume does not need to be replicated. However, in terms of actual data to transmit, approximately 125 GB of the remaining 175 GB of compressed data for the D drive has already transferred to the staging area. The replication speed will vary, going faster while the already transmitted data on the target Core is validated, and then slower as the transfer resumes sending new data. Based on the initial calculation of 25.6, and considering the initial job ran for 20 hours, this second replication job should take approximately 5.6 hours, plus a bit of additional time to validate previously transferred data.

Factors affecting resumption of a failed or canceled replication job

The two most common Rapid Recovery behaviors that lengthen the process of resuming replication after failure or cancellation are the time it takes to validate the data in the staging area, and the staging area being cleared after a failed or cancelled replication.

Validation time is a factor of job size, total available bandwidth, and latency. The larger the amount of data to be validated, the longer validation takes. The smaller the available bandwidth, the longer validation takes. The higher the latency, the longer validation takes, since validation requires many short communications back and forth.

Latency has a surprisingly large impact on validation time. Even with latency of only 100 milliseconds from one site to another, validation consumes nearly twice the time it would with a latency of 50 milliseconds, and almost 100 times longer than if the latency was 1 millisecond. When the Core communicates back and forth with hundreds of thousands of individual packets, a few milliseconds per packet adds up to significant time lost.

If a replication job fails because the target Core server loses power or crashes, or if the Core service is restarted, then replication progress is lost. The data in the staging area is flushed when the repository is mounted following core service start. The replication job must be repeated, and all data transfer must begin anew. Thus you can see that maintaining the stability of your target Core server is critical to ensuring that replication jobs don't fail, or can be resumed after failure or cancellation without losing replication transfer already in progress.

The affect of deduplication on replication speed and performance

In the example above, all data and calculations assumed data was brand new to the target repository.

But what happens if you have a repository that is in use, and you capture a second base image for a server that has already replicated? One could assume that replicating deduplicated data would be significantly faster than replicating the first base image, but this may not be the case.

Think of the process exactly the same as if the data had already been written to the staging area. The Core must validate deduplicated data, just as it must after a replication failure or cancellation. This validation requires data transfer from the source to the target. The larger the base image, the more validation must be conducted.

Even if the Core was able to successfully deduplicate most of a base image, the replication process will not be equally as quick. According to initial baseline tests, replication of deduplicated data requires about 10% of the total amount of data transmission in order to validate data. Therefore, if the first base image included transmitting 225 GB of data, the second base image (even if all blocks are duplicate blocks) still require about 22.5 GB of data transmission.

The key factor that can help or hinder replication of duplicate base images is the deduplication cache. Replication relies solely on the deduplication cache to reduce the amount of data that must be transmitted to the target Core. If you have multiple base images being replicated and you max out the deduplication cache, you will see a decrease in replication performance. A full deduplication cache no longer contains many of the deduplication hash values required, causing the remaining un-transmitted blocks to be treated as if they are new to the target Core. As a result, the source Core must transfer that data over the network again. Hence, proper deduplication cache sizing is critical to ensuring optimum replication performance.