Data Deduplication

Data Deduplication

Table of Contents

What is data deduplication?

Data deduplication is the process of dramatically reducing the amount of backup data stored by eliminating redundant data. In its simplest terms, data deduplication maximizes storage utilization while allowing organizations to retain more backup data on disk for longer periods of time. This tremendously improves the efficiency of disk-based backup, lowering storage costs and changing the way data is protected.

How does deduplication work?

Data deduplication works by comparing blocks of data or objects (files) in order to detect duplicates. Data deduplication works by comparing new data with existing data from previous backup or archiving jobs and eliminating the redundancies. Because only unique blocks are transferred, replication bandwidth requirements are reduced.

Deduplication can take place at two levels — file and sub-file level. In some systems, only complete files are compared. This is known as Single Instance Storage (SIS). This is not as efficient as sub-file deduplication, because entire files have to be stored again as a result of any minor modification to that file. Sub-file deduplication or block-based deduplication utilizes either fixed length blocks or sliding blocks for comparison.

What are the different deduplication methods?

There are varying methods of deduplication – inline, concurrent, or post-process deduplication. Various dedupe options are offered to give the storage administrators the ability to select the best method based on the characteristics and/or business value of the data being protected.

Inline Deduplication

Inline deduplication processes the backup stream as it comes in. This reduces the storage required because inline does not require a staging area, but it also tends to be slower than opting for a post-processing approach.

Concurrent Deduplication

Concurrent deduplication is more accurately described as a “concurrent overlap.” Deduplication does not wait for all backup jobs to complete; rather, it begins as soon as the first virtual tape or file is completed. Meanwhile, other backup jobs continue to run concurrently with the deduplication process. One of the primary benefits of concurrent deduplication is more immediate replication. As soon as the deduplication process is complete, replication to the data center or to a disaster recovery site can be initiated, ensuring that the most critical data is available at all times.

Post-Process Deduplication

In post-process deduplication, also called offline deduplication, the deduplication process is performed independently from the backup process. The backup data is written to temporary disk space first, and then the deduplication process starts based on a user-defined schedule. Deduplicated data is copied to the repository disk for long-term retention. In this fashion, the backup speed is unaffected by deduplication workloads, and vice versa. An administrator can apply deduplication policies, export data to physical tape, and schedule the deduplication to take place as a concurrent process or at a later point in time. This high-speed flexibility allows IT departments to maximize the efficiency of their operations while delivering reliable and predictable performance.

Learn more