Demystifying Data Deduplication

DEMYSTIFYING DATA DEDUPLICATION

Abstract

FalconStor’s global data deduplication capability remains a core differentiator that drives out cost and accelerates the backup and archival process. StorSafe receives backups from many sources, then chunks, hashes, and reduces them so it can store them cheaply on-premises or in cloud object storage. While every data set is different, our customers experience up to 95% global deduplication rates which drives costs down by 90%.

INTRODUCTION

Since our founding, FalconStor has led the way in deduplication technology, as evidenced by our portfolio of 9 patents in deduplication issued to us from 2006 through 2019. Not all dedupe technologies are alike. Our sliding window technique and knowledge of the backup set formats emitted by each individual enterprise backup and restore vendors has kept us in the lead for over 15 years running – for backup throughput, data reduction, and restore times.

KEY CRITERIA FOR A ROBUST DEDUPLICATION SOLUTION

There are several important criteria to consider when evaluating a deduplication solution:

1. COMPATIBILITY WITH CURRENT ENVIRONMENT

An effective deduplication solution should be as non-disruptive as possible. But the ideal solution integrates with an organization’s existing backup environment, has flexible deployment options to provide global coverage across the data center as well as branch and remote offices, and enables administrators to tune the technology to the exact needs of their datasets.

Most companies have turned to backup to disk targets including virtual tape libraries (VTL) like StorSafe to hold on-premises backups instead of relying on tape libraries. With disk comes the potential to dramatically reduce the size of backup data through deduplication techniques without having to make significant changes to policies, procedures, or software.

Because StorSafe is software rather than pre-built appliances, it scales nicely from using only one VM to using up to 9 industry-standard servers. StorSafe gets its storage from any disk array, object, or cloud object storage. There is never any hardware vendor lock-in, so customers can select the server and storage types and providers that fit best for their needs.

2. IMPACT OF DEDUPLICATION ON BACKUP AND RESTORE PERFORMANCE

Inline deduplication processes the backup stream as it comes in. This reduces the storage required because inline does not require a staging area, but it also tends to be slower than opting for a post-processing approach. Various dedupe options are offered to give the storage administrators the ability to select the best method based on the characteristics and/or business value of the data being protected.

Some technologies are good at deduplicating data but perform much slower when it comes to rebuilding data (often referred to as “re-inflating” data). If you are testing competing solutions, you need to know how long it will take to restore a large database or full system. Ask the solution provider to explain how they can ensure reasonable restore speeds. Compare backup and restore performance metrics for the vendors you are considering.

3. SCALABILITY

Scalability is an essential consideration, particularly in terms of capacity and performance, since data growth continues unabated in many organizations and data archival continues to grow.

Depending on your growth expectations, the volume of data you need to protect today may well grow many times over in five or ten years. And also consider how much data will you want to keep on disk for fast access versus exported to the public cloud, on-premises storage or tapes during this timeframe. All of these considerations make the scalability of your deduplication capabilities essential.

The ideal deduplication solution should have an architecture that allows economic “right-sizing” for both the initial implementation and the long-term growth of the system. In addition to clustering, the deduplication vendor should be able to easily add nodes or storage as needed as data volumes grow without requiring forklift upgrades to handle new capacities. While purpose-built backup appliances with deduplication capabilities have been a great option for organizations to realize the immediate benefits of data reduction, they have been notorious for requiring disruptive retirement / implementation cycles to handle increases in capacity.

4. SUPPORTING THE DISTRIBUTED ENTERPRISE

As organizations grow, their data operations naturally evolve from a centralized to distributed operational model, bringing it with it a proliferation of additional sites. Deduplication solutions that can address this changing environment by combining global deduplication for all incoming backup streams with replication for data security drives down cost.

For example, a company with a corporate headquarters, regional offices, and a secure disaster recovery (DR) facility requires deduplication in its regional offices to facilitate efficient local storage but also replication to a centralized site and increasing the cloud. Replicating or exporting duplicate data chokes networks, wastes resources and squanders time. So again, the right deduplication solution must fit the operations of the evolving enterprise.

5. HIGHLY AVAILABILITY

Since a large amount of data is consolidated in one location, risk tolerance for data loss is exceptionally low. Therefore, access to the deduplicated data repository is critical and should not be vulnerable to a single point of failure.

A robust deduplication solution will include mirroring to protect against local storage failure as well as replication to protect against disaster. The solution should have failover capabilities in the event of a node failure. Even if multiple nodes in a cluster fail, the company must be able to continue to recover its data and maintain ongoing business operations.

6. EFFICIENCY AND EFFICACY

File-based deduplication approaches do not reduce storage capacity requirements as much as those that analyze data at a sub-file or block level. Most sub-file deduplication processes use some sort of “chunking” method to break up a large amount of data into smaller sized pieces to search for duplicate data.

Larger chunks of data can be processed at a faster rate, but at lower rates of duplication and data reduction. It is easier to detect more duplication in smaller chunks, but the overhead to scan the data is higher. Some solutions even adjust chunk size based on information gleaned from the data formats. The right combination of these techniques can lead to a 30 to 40% increase in the amount of duplicate data detected and deliver significant storage savings.

SUMMARY: FOCUS ON THE TOTAL SOLUTION

Data deduplication is the best way to dramatically reduce data volumes, slash storage requirements, and minimize data protection costs and risks. No matter the approach, the amount of data deduplication that can occur is driven by the nature of the data and the policies used to protect it.

FalconStor technology is unique in that it allows users to select the method best suited to their data recovery and protection criteria. The user has the choice of inline, post process, concurrent deduplication, or no deduplication at all.

FalconStor® is a registered trademark of FalconStor Software, Inc. in the United States and other countries.

All other brand and product names are trademarks or registered trademarks of their respective owners.

FalconStor Software reserves the right to make changes in the information contained in this publication without prior notice. The reader should in all cases consult FalconStor to determine whether any such changes have been made. FDMPWP052019

RESOURCES

What is data deduplication and how does it work?