Deduplication – Understanding the Cleaning Process

Demands and expectations of today’s modern backup and recovery solutions are challenging, it is not unreasonable for small and medium sized companies to accumulate hundreds of terabytes of backup data.  If backup administrators are not paying attention, they can easily wander into expensive territory.

Besides the initial investments of repository storage, additional expenses include power & cooling, data center footprint, maintenance and management. 

Because of this, deduplication technology has been developed to specifically reduce backup repository capacities helping data centers to meet today’s backup challenges.  Deduplication impacts are significant due to the reduction of storage repository investments, management, maintenance and power and cooling.   Deduplication allows IT administrators to do more with less.

When evaluating various deduplication solutions, having a clear understanding of an environments needs such as scalability, ingest rates (performance), deduplication abilities (source side, target side, in-line, post-process), and replication abilities are extremely important to finding the right solution.  But, when working with customers, we have discovered that a critical piece of deduplication solution information is typically not accounted for – the proper understanding of a deduplication solutions cleaner process.  Let me explain why this understanding important.

A deduplication cleaner’s job is to reclaim unused space.  It does so by processing data block reference counts and removing data as it ages out.  A data block reference count is updated when new data is ingested or when data expires.  Over time when a data block reference count reaches zero (backups do not reference this data anymore), the cleaner removes this unreferenced data resulting with additional free space.  Because the cleaner can be an IO intensive process, it can run for longer periods of time when large amounts of data have aged out (or manually deleted).  Running the cleaner process during an active backup or recovery is not recommended as it introduces contention to a deduplication solutions disk resources, which results in slower backup performance and longer backup windows.   Because of this, cleaner processes are scheduled to run outside of backup windows when the deduplication solution is idle.

Because some backup environments run 24x7 (or close to it), administrators lack time for the cleaner to complete.   If this occurs, the deduplication solution storage assets begin to rapidly fill up, creating a backup log jam.  To assist, some vendors offer cleaner strategies to help ensure the cleaner process will adequately reclaim space.

Thus, during the evaluation process for a deduplication solution, gaining an understanding of a solutions cleaner process is critical to achieving backup SLA’s.  An efficient cleaner is important and desired for any backup solution that utilized deduplication solutions.

Quest DR Series appliances implement many cleaner efficiencies for it to finish as quickly as possible.  For example:

  1. When large amounts of data have aged out or been deleted, the process to clean this amount of data could take some time before any space reclamation is recognized. Instead, when large amounts of data need to be cleaned, DR Series appliances reclaim smaller chunks of capacity with more frequent intervals.  This way, steady intervals of free space becomes available earlier, when you need it, not all at once at a later time.
  2. If there is work to do, the DR Series appliance should not be idle. This means that if the DR Series appliance becomes idle, it will automatically start its cleaner process (if there is cleaning work to do).  When the ingesting of backup data begins, the cleaning process is automatically paused so it does not interfere with backup ingest performance.  For busy environments, the DR cleaning process will ensure cleaning work is executed during optimal times.
  3. In the event that capacity reclamation becomes a high priority - the DR Series appliance cleaner can run concurrently along with other appliance activities until it has completed.
  4. The cleaner process intelligently sorts ingest and deletions so that reference counts can be updated as quickly as possible. For example, if a block of data has been ingested three times and has been deleted (or expired) one time, the DR cleaning process sorts through block events so that only one disk IO for the block reference is updated, not all three.  In this example, the cleaner will update the block of data’s reference count by two.

Understanding and defining a deduplication cleaning schedule is a critical component when choosing the right deduplication solution for a backup environment.   Quest DR Series appliances offer a cleaner process that is flexible, configurable and intelligent. Additional information about the DR Series appliances cleaner process can be found here.

About the Author
Ward.Wolfram
Ward is a member of Quest Technical Marketing and has been focused on Backup and Recovery solutions since 2013. Prior to Quest, Ward has invested 18 years surrounding various enterprise storage solution...