Rapid Recovery is often compared with other backup products and deemed to be "not enterprise-capable." Many of these comparisons are done without fully understanding Rapid Recovery's design and default configuration. This blog post attempts to dispel some of the myths in these comparisons. In doing so, it also gives advice to users who want to protect many machines with Rapid Recovery using fewer resources with longer, "enterprise-scale" recovery point objectives (RPO).
Proper sizing of a Rapid Recovery Core is critical to ensuring that the system can manage the amount of data and the jobs you intend to run. Some key factors which affect performance include:
- the number of machines you want to protect
- frequency of backups
- change rate
- post-backup processing (offsite replication, backup checks, VM export, archiving)
By understanding how each of these factors affects the performance of the Core, it is possible to scale Rapid Recovery effectively to protect hundreds or thousands of machines.
Change rate and frequency of backups directly impact the number of machines a single Rapid Recovery Core can protect effectively. In testing, we have found that a Rapid Recovery Core server can handle thousands of machines with small change rates and no more than one backup per day. As the change rate or the frequency of backups increases, the load on the Core increases correspondingly, and the number of machines that can be protected decreases.
Post-backup processing adds additional load on a Core server. Any job that does a check of recovery point data (SQL attachability, Exchange mountability, nightly integrity checks) add load on the Core server. The more protected machines with checks of the recovery points, the fewer protected machines a Core can handle.
Replication, archiving, and virtual standby add unique loads to a Core server. When only backups are running, the majority of the load on the repository is writing data. Writes to a storage array are often highly efficient and can use caching to allow large quantities of data to be processed quickly. Replication, archiving, and virtual standby require reads from the repository in order to copy the data to either the remote Core, the target location for the archive, or the hypervisor on which the virtual standby machine is being updated. Reads are far less efficient and cannot be cached effectively by storage arrays. Deduplication also increases the likelihood that any read the Core software must do is randomly located, since data are often not stored sequentially. The more machines that have replication, archiving, or virtual standby configured, the fewer machines can be protected by a Core server efficiently.
When protecting hundreds of machines, there are some weaknesses you should be aware of:
- Deploying Rapid Recovery Agent to hundreds of machines simultaneously may take significant time.
- Agentless protection of hundreds of machines at once may take significant time.
- Navigating through the UI with many protected machines can be slow.
- Protection schedules and retention policy cannot be configured based on labels.
- The Mongo database (which provides the back-end for the Events page) may become very large and slow, as large quantities of events are stored.
When protecting hundreds of machines, the performance of the underlying disk array is critical to the overall performance. The following points may be helpful to minimize the weaknesses listed above:
- Size the deduplication cache to appropriately use the available RAM for the attached repositories.
- Increase the number and size of retained AppRecovery.log files. This is critical to monitoring and troubleshooting.
- Tune the retention policy to avoid rollup of the base image every day.
- Use the Protect Multiple Machines wizard to batch-protect a group of machines you want to back up at the same time and using the same backup interval. Doing so precludes the need to manually change individual schedules.
- Protect machines in batches, staggering backup times. Observe the performance of the first group prior to adding the next batch.
Example: On Monday, protect one-eighth of your machines with a protection window that starts at 6AM. On Thursday, protect another eighth of the total machines, with a protection window starting at 8AM. Repeat this process on Mondays and Thursdays for the next three weeks, staggering start times by two hours, until all machines are protected. Monitor the Core's performance as you go, and adjust accordingly (for example, you may increase the interval to three hours).
- Ensure the OS volume of the Rapid Recovery server uses SSD. This improves the speed of the job that persists the deduplication cache, the responsiveness of the Mongo database which stores all events, and the speed of mounting recovery points.
In summary, it is possible to use Rapid Recovery as an enterprise backup product, protecting several hundred machines with a single Core server. By understanding the design of Rapid Recovery and setting some key configurations, you can make it a highly effective backup software for large environments.