Part I of II
One of the more interesting storage improvements in recent years was the addition of deduplication to what was traditionally “just plain storage”. Deduplication makes better use of storage by reducing the amount of duplicate information stored on disk. Deduplication breaks up all files into chunks and looks for duplicate information. Instead of storing the same bits over and over, only unique bits are stored. Storage saved. This is completely transparent to the end user. There is generally a lot of duplicate information between files and it’s not uncommon to have deduplication reduce the storage footprint by 10-20X. That means that 20-40TB of data could fit on 2TB of actual storage.
One thing that helps deduplication effectiveness is avoiding things like data compression during backup. Deduplication vendors drill this concept into their customers. Compression reduces the number of bytes in the original file, but it also randomizes the remaining bytes. And files with random bytes simply do no dedupe as well as their uncompressed brothers. For a small text file or Access database that transfers to disk in a second or two, not compressing with your backup software is fine. In those cases, letting the deduplication storage sort out the files results in no significant impact to the end user. When those files are read, the deduplication appliance puts them back together in quick fashion because they are small. But is this the same experience DBAs have with their database backups?
Deduplication generally arrives at the IT corporate level for general storage and backup. Windows admins use them for file servers and backup admins use them as targets to back up those file servers. Eventually though, the word comes down to the DBAs with the request they start using this new storage for backups. And what comes with that request is usually the following: “You need to stop compressing your backups.” For those customers using a third-party product like LiteSpeed® for SQL Server, this usually results in a call to Support asking for guidance. DBAs know the benefits of compressing, but what to do with these competing concepts? The truth is that backup compression and deduplication are more complementary than conflicting. We tell customers that it’s in their best interest to continue to compress their backups.
First, let’s clear up one myth: Compressed files do not break deduplication devices. So you're not putting your new, expensive storage at risk. In fact, you might even be helping. Now that we got that out of the way, let’s talk about what you get when you compress vs when you don't.
Deduplication vendors base their ROI on how effectively they deduplicate data. But ratios are just numbers and without some context, they do not reveal the complete story. DBAs have some clear goals as they relate to backup:
Improve Backup Speed
How do you improve backup speed? You write data faster. How do you write data faster? You reduce the number of bytes written to disk through source-side compression. Compression has the benefit of making your disk target more impressive that it really is.
In most data centers the disks used to host the data for a database are designed for high-performance. Disks are the slowest part of the server and you achieve high IO rates and transaction throughput by using a lot of disks in your arrays and implementing things like RAID 10. These arrays are often capable of reading and writing data extremely fast. Most backup locations, on the other hand, are far less powerful and cannot write nearly as fast as your data drives can read. They are designed for optimal storage and may implement things like RAID 5 in order to reduce cost – RAID 5 has far slower write speeds than RAID 10.
It’s a least-common-denominator problem. Your backup drives cannot write as fast as you can read the data, so your backups are bottlenecked at the write speed of the target. Backups run slower. You overcome this problem with compression. If your backup drives are half the speed of the data drives, simply compressing 50% completely overcomes this disparity. Backups that used to take 4 hours now run in 2 with compression.
Your new deduplication devices are no different. They have a maximum ingest rate for data and it’s likely they cannot match the performance of your database data drives. That’s likely true for a single backup, but this may be even more pronounced when many backup operations are running simultaneously to the same deduplication target; a likely scenario. Not compressing your backups may make them take longer and that’s not something you want. It may mean you're operating outside your maintenance windows. It may also mean longer restores.
If you're concerned about CPU utilization on your database server to achieve the compression, don't be. Products like LiteSpeed® include compression levels specifically designed for very low CPU utilization and also include features like Adaptive Compression to ensure backups adapt to changing server loads.
Improve Restore Speed
How do you improve restore speed? You read data faster. How do you read data faster? You reduce the amount of data you need to read by compressing it first at the source.
When you read data from a deduplication device, the backup file must be rehydrated. Rehydration is the process of taking all the deduplicated parts and putting them back together for transmission to the client. Rehydration has overhead. The larger the backup file, the more overhead you have. And that can mean longer restore times. Placing less data on the deduplication device means that rehydration is faster as there are fewer parts to put back together. Your restores run faster as a result.
Again, if you're concerned about CPU utilization on your database server needed to decompress the backups for restore, don't be. Decompression is really simple to perform and does not use very much CPU.
How do you reduce storage? You store less data. How do you store less data? You compress and deduplicate. Remember that deduplication appliances will deduplicate compressed data. Most databases change little day to day and even with compression there will be common bits between backups. Compressed backups should see a 2:1 reduction compared to something closer to 10:1 for uncompressed backups.
But don't be fooled by thinking the deduplication ratio is the most important metric. The dedupe ratio only tells you how well the data you sent to the deduplication device has been reduced. It doesn’t tell you what the ultimate storage footprint looks like because it doesn’t consider what was done to the data before it was sent.
We'll review a full example in Part II of this post, but consider the following example of a 1TB database:
Backup Compression & Deduplication
The “do not compress” decree is a misnomer. It’s not in the best interest for DBAs as it will very likely increase backup and restore times. It’s not in the best interest for the businesses that drive the applications supported by those databases as it may not meet their restore time objectives. It’s not in the best interest of the storage administrator because it may increase overall storage requirements. It’s not in the best interest of the deduplication devices as they need to work harder to deduplicate far more data.
We encourage customers to test. Compare your backup speed with and without backup compression. Compare your restore speed. Calculate the actual storage used.
In the next post, I'll show an actual example using LiteSpeed for SQL Server on a TPC-E database with a deduplication appliance. The results should prove interesting.
Click here for Backup Compression and Deduplication: Good or Bad? Part II
Hi David, I know this is old post but I think you are missing one important thing here - compressed and deduplicated backups will 'consume' 75GB each time you take backup, while that 100GB of deduplicated data will grow by very small margin over time/with next backup as majority of the data will be taken as common part and 'deduped'. In my small tests I can see daily grow of under 1 GB on 43 GB uncompressed SQL backup (Windows 2016 dedup).
I fully agree about other points of network transfer volume etc.
I've put more details here: