Backup Compression and Deduplication: Good or Bad? Part II

Part II of II


Click here for Backup Compression and Deduplication: Good or Bad? Part I


In Part I, I discussed using backup compression with a deduplication device may actually be better for backup and restore performance than to skip backup compression and rely on the deduplication device to do all the work. And I argued it might even improve your backup footprint. In part II, we'll look at a real-life example.


Like most of you, I eventually found myself faced with the backup compression – dedupe question. And since I had access to a lot of development resources, we went to work to test the various scenarios using LiteSpeed for SQL Server 6.5 with a deduplication appliance we have in the lab. We used a copy of real production database that experienced daily changes from the various transactions that run throughout the day (new data from inserts, updates to existing data, and removal of old data via deletes).


Every night, we copied this database in the lab and did two full backups and measured the backup duration of each: One compressed using LiteSpeed and one native using no compression at all. At the end of 21 days, we had two sets of backups that we pulled off the deduplication device and could later send back to the device, one backup at a time, one set at a time, and measure how well the deduplication device was able to further reduce storage as each backup is added. Data compression was enabled on the deduplication device to allow it to further compress the saved chunks.


What we found was that with these database backups, the sweet spot was 12 days. Meaning, that the best deduplication effectiveness was achieved after the 12th full backup was complete. This is what you’d expect in most environments. Many deduplication vendors recommend you place 14 days or more of backups on the devices. Why? Because the more of the same database backup you have on the device, the more common information you’ll likely have to deduplicate and the better the deduplication ratios. Unless there is a pressing reason, you may not need to keep that many backups on disk. If you only need to keep 7 days of backups, your deduplication effectiveness may be reduced (ours was reduced about 30%).


Imagine two different databases with a single full backup of each placed on a dedupe device. Without dedupe compression, what you’d probably see is an increase in storage when compared to the sum of the individual database backups. Why? Because index entries are created and saved by deduplication devices for each chunk of data. The index is storage overhead. If you have two different databases, then you won't have many matches (if any) and you end up with the overhead of the index without any offset by deduplication.


Contrast that with 12 daily backups of the same database. You are going to find a lot of matches, especially when performing Full backups. I could argue that using LiteSpeed’s Fast Compression is a better way to go since it further reduces the backup footprint at SQL Server by using Differential backups, but we'll leave that discussion for another blog post. Even if your database experiences 5% of change per day, you're not likely to achieve 20:1 deduplication. Inserts can cause page splits. Updates change a small amount of data on a page. A single byte of change in a column in a row may look to a dedupe device simply as enough of a difference that it makes it look like a new chunk. Most dedupe devices use things like variable chunk sizes to reduce the effect small data changes have on overall deduplication effectiveness. This helps, but it doesn’t eliminate the issue.


Our figures include index size. As I mentioned earlier, deduplication devices use indexes that track all the chunks of data. Our figures include the size of the index in the overall deduplication figures. That only seems fair. If a car dealer told you a car could do 0-60mph in 4.5 seconds when no one was in it, you’d go somewhere else since I'm pretty sure you're planning to sit behind the wheel and I'm also pretty sure you weigh more than 0. The index is the cost of doing business, so we account for it in the numbers.


What we found was that compressed LiteSpeed backups were faster to run, faster to restore, used less storage, and required less network bandwidth. You're results may vary.


(click image for larger view)