Understanding Advertised Deduplication Savings for Backup Workloads

Most backup solution materials for software and hardware deduplication appliances advertise measurements of efficiency such as “… up to 20:1 savings…” or “… experience 95% backup capacity savings…”

These advertised deduplication savings values can be misleading if not properly understood. The goal of this blog is to equip you with the elements needed to understand advertised measurements and to help make better deduplication decisions.

Deduplication simply removes redundant data. For example, your company name may exist on thousands of documents. During a general backup, the copy of your company’s data will include the thousands of redundant copies of your company’s name. By applying deduplication to backups, only one copy of your company name is preserved resulting in significant backup storage savings. As future backups are preserved, deduplication continues to discard redundant data when a copy is already stored.

Deduplication adds significant costs savings to current and new backup environments such as 1) Reduced backup storage hardware investments; 2) Enhanced backup performance; 3) Reduction of data center space; 4) Reduced use of power and cooling and 5) Less stuff to manage. The higher the deduplication savings, the greater the deduplication benefits that were just listed.

Calculating Deduplication Savings
The first step to understanding advertised deduplication savings is the knowledge of how deduplication savings are calculated.

Measuring deduplication is simple:
   1. Measure the amount of data ingested into the deduplication device (X). Then,
   2. Measure the amount of data the deduplication device preserves (Y).
   3. From this, articulate the deduplication savings as a ratio or percentage

Note: A deduplication device can represent both hardware and software deduplication offerings. For example, hardware-based backup to disk appliances such as a Quest DR Series appliance, Dell EMC Data Domain or HP StoreOnce device is popular. Many backup software products include native deduplication options such as CommVault, Veeam or Quest Rapid Recovery. Thus, a deduplication device can also represent a software solution as well. Additionally, the content in the blog applies to deduplication that exists within primary storage solutions, but the focus of this blog is with backup workloads.

For Example, if 20GB of backup data is ingested into the deduplication device (X=20) and only 1GB of unique backup data is preserved on the deduplication appliance disk (Y=1), a deduplication vendor may advertise a 20:1 savings ratio or 20X savings.

Deduplication savings may also be expressed as a percent. To determine the percentage value of a 20:1 savings ratio:

1. Find the deduplication savings amount. In this example, 20-1=19GB of data was saved.

2. Now divide the savings amount by the initial backup amount. In this example, this is 19/20 = 95%.

Thus, a 20:1 savings ratio is equivalent to a 95% savings percentage.

Is 40:1 savings twice as good as 20:1 savings?
If one is not careful, storage savings ratios can be deceptive. Because of this, it is recommended to convert these ratios to their equivalent savings percentages for clarity.
   • 40:1 savings ratio is equivalent to 97.5% savings
   • 20:1 savings ratio is equivalent to 95.0% savings

As you can see, the savings difference between 40:1 and 20:1 savings ratios is fairly close, not twice as good.

How many copies?
Here is where things get tricky - Don’t expect a 95% deduplication savings on the very first backup set ingested to a deduplication device, not even on the second or third backup. Why? When the very first backup is ingested into the deduplication solution, the data is mostly unique resulting with low deduplication savings.

As a second backup is ingested, redundant data from the first backup is discovered but don’t expect a 95% storage savings here either. Why? Let’s break this down by examining a deduplication’s best case scenario: when no change occurs from the previous backup. In this is the case, all of the second backup data is redundant and can be deduplicated resulting in an at best 50% savings total.

       Input: 2 backups
       Output: 1 backup
       Savings Total: 2:1 or 50%

*Note: Weekly backups typically includes a 10% change of data. Since this new data is predominantly new and unique, a ~45% instead of 50% savings would be expected:

Input: 2 backups
Output: 1.1 backups
Savings Total: 2:1.1 or 45%

When a third backup is ingested, the best case scenario savings total for a deduplication solution will be ~ 67%.

       Input: 3 backups
       Output: 1 backup 
       Savings Total: 66.67%

As more backups are ingested into the deduplication device, better deduplication savings are achieved. It is easy to see that the number of backups greatly impacts the storage savings of a deduplication solution and must be considered when comparing deduplication solutions. It is important to remember: To accurately determine the effectiveness of a deduplication solution, the number of copies of data (or backups) to achieve the advertised savings must be understood.

Because deduplication solution vendors do not share the number of copies used to achieve deduplication savings, it makes it virtually impossible to compare deduplication effectiveness from one vendor to another. Because of this, the same deduplication solution vendor could have the worst or the best-advertised deduplication savings just by altering the number of copies used to achieve their desired savings results.

How to compare deduplication solutions for backup workloads?
I have two suggestions:

1. First, determine the kind of deduplication that is implemented. There are many kinds of deduplication algorithms that are used for different types of workloads, but variable-block sliding window deduplication engines are purpose-built for backup workloads which offer the best savings results.

Note: Image-based backup software solution have recently become popular. Since these type of products work at the block level, a variable-block sliding window deduplication approach may not be an option. In these cases, a fixed block deduplication method would be the next preferred choice.

Note2: Backup software solutions typically implement fixed block deduplication due to simplicity and a reduced need for computing resources but the results are not as good as variable block sliding window. Most backup target devices such as the Quest DR Series appliances implement variable block sliding window deduplication logic that efficiently uses the proper amount of resources, not just for best storage savings, but for optimum performance.

2. The second and best way to evaluate deduplication savings is to conduct a test within your own environment. Because you are using your own data and any type of retention policy (number of copies), you can accurately determine what deduplication solution works best for you. Many backup deduplication vendors offer a full featured free trial virtual machine that can be easily downloaded and installed into an environment for testing and evaluation. These virtual machines provide the same deduplication savings as their hardware equivalents. Select the following link to test drive variable block sliding window deduplication using the Quest DR2000v – a backup target virtual machine.

Compression is a different type of storage savings technology that reduces the number of bits needed to represent data. Deduplication and compression work well together. Typically, deduplication is first applied to remove redundant data followed by applying compression to the remaining deduplicated data to further improve savings.

The addition of compression does not alter deduplication savings as it is related to the number of copies ingested as described above. Rather, compression accelerates the deduplication solution savings.

To Sum It All Up:
When considering deduplication for backups, the advertised deduplication savings does not accurately reflect storage saving for your environment. It is highly recommended to test a deduplication solution within your own backup environment where you have a complete understanding of the type of data and the number of copies required to achieve a savings value. Comparing this way will ensure you have accurate information for making the best deduplication decisions for your backup needs. Looking for more? Download our Free White Paper, "What's the Right Deduplication Solution for Your Organization?"

 Download White Paper


Additional Deduplication Resources:
Download a free trial of the DR2000v virtual appliance
Deduplication and Backups; On Purpose for a Purpose