A Thorough Investigation of Content-Defined Chunking Algorithms for Data Deduplication

Marcel Gregoriadis,Leonhard Balduf,Björn Scheuermann,Johan Pouwelse
2024-09-29
Abstract:Data deduplication emerged as a powerful solution for reducing storage and bandwidth costs in cloud settings by eliminating redundancies at the level of chunks. This has spurred the development of numerous Content-Defined Chunking (CDC) algorithms over the past two decades. Despite advancements, the current state-of-the-art remains obscure, as a thorough and impartial analysis and comparison is lacking. We conduct a rigorous theoretical analysis and impartial experimental comparison of several leading CDC algorithms. Using four realistic datasets, we evaluate these algorithms against four key metrics: throughput, deduplication ratio, average chunk size, and chunk-size variance. Our analyses, in many instances, extend the findings of their original publications by reporting new results and putting existing ones into context. Moreover, we highlight limitations that have previously gone unnoticed. Our findings provide valuable insights that inform the selection and optimization of CDC algorithms for practical applications in data deduplication.
Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that the current Content - Defined Chunking (CDC) algorithms lack comprehensive and fair evaluation and comparison. Although many CDC algorithms have been proposed in the past two decades, there is still a lack of systematic theoretical analysis and experimental comparison of these algorithms. Specifically, the paper aims to: 1. **Conduct a comprehensive theoretical analysis**: Conduct a strict theoretical analysis of several leading CDC algorithms. 2. **Fair experimental comparison**: Conduct an unbiased experimental comparison of these algorithms using four real - world datasets. 3. **Evaluate key indicators**: Evaluate these algorithms through four key indicators: throughput, deduplication rate, average chunk size, and chunk size variance. 4. **Reveal the limitations of existing research**: Point out the limitations that have not been noticed in previous research and compare the new results with those in the existing literature. ### Specific problem description In the era of big data, cloud storage systems have become an indispensable part of managing the explosive growth of digital information. Due to the high storage costs, these systems require efficient data reduction techniques. At the same time, the development of the Internet of Things has also highlighted the importance of reducing data transmission between edge devices and cloud servers. In large - scale systems, as data accumulates, content often repeats, resulting in inefficient use of bandwidth and storage resources. **Data deduplication**, as a solution, reduces storage and bandwidth costs by eliminating redundant content at the chunk level. Files are divided into multiple chunks, and each chunk is indexed and identified by its encrypted fingerprint. Subsequently, the file can be represented as a series of such fingerprints, so that duplicate data blocks only need to be stored or transmitted once. Since the size of the fingerprint is much smaller than the content it represents, this method is very effective in systems with a large amount of redundant content. However, the algorithm used for chunking has an important impact on the deduplication effect. The simplest method is Fixed - Size Chunking (FSC), which divides files into equal - sized chunks. However, this method has the boundary - shift problem, that is, two files share similar content, but redundancy cannot be detected due to the misalignment of chunk boundaries. For this reason, the Content - Defined Chunking (CDC) algorithm generates variable - sized chunks based on content rather than position to solve this problem. Although many CDC algorithms have been proposed over the years, claiming higher efficiency, lower chunk size variance, or better deduplication effects, a comprehensive and unbiased evaluation of these methods is still lacking. Each study usually only shows the superiority of its algorithm, and the datasets and assumptions used are often favorable to its method, which makes it difficult to understand the real frontiers in the CDC field. Therefore, this paper fills this gap by conducting a strict evaluation of multiple CDC algorithms, providing valuable insights into the capabilities and limitations of these algorithms, and providing guidance for selection and optimization in practical applications. ### Main contributions of the paper 1. **Comprehensive evaluation**: Selected multiple CDC algorithms including Rabin, Buzhash, Gear, AE, RAM, MII, PCI, and BFBC for evaluation. 2. **Derivation of new formulas**: Derived new formulas to relate algorithm parameters to the expected average chunk size, especially for AE, RAM, MII, and BFBC. 3. **Improvement of existing formulas**: Improved the existing formula of the AE algorithm. 4. **Experimental verification**: Conducted experiments using four real - world datasets, covering throughput, average chunk size and variance, deduplication rate, etc. 5. **New findings**: Reported new experimental results and compared them with the results in the existing literature, revealing previously unnoticed limitations. Through these works, the paper provides a new perspective for understanding the capabilities and limitations of CDC algorithms and provides valuable references for selection and optimization in practical applications.