Abstract:Deduplication-based techniques are popular in backup storage systems for reducing data volume. To maximize data reduction, existing fine-grained deduplication approaches not only eliminate duplicate chunks but also delta-compress non-duplicate chunks as delta relative to their similar (base) chunks. However, each chunk may have multiple similar chunks, and delta compression usually only selects one of them as the base chunk, i.e., a one-to-one scheme. This scheme benefits the restore performance because it needs to read only one (instead of multiple) base chunk in decompressing delta chunks, while it also wastes the potential compressibility among other similar chunks. In this paper, we propose SuperDelta to further exploit compressibility across multiple similar chunks and to preserve the restore performance advantage of the one-to-one scheme as much as possible. It is based on three techniques. (1) To further eliminate redundancy among similar chunks, SuperDelta applies a "Multiple Referenced Base Chunks" (MRBC) scheme instead of the one-to-one scheme. It combines several similar pairs of chunks in delta encoding to recover possibly lost compressibility in "boundary shift" problems. (2) To avoid the negative side effects of MRBC on restore performance, SuperDelta introduces a rebase scheme to rebuild simple reference paths among duplicate and similar chunks. It significantly simplifies the restore workflow, but also costs slightly more storage space because of impacting the workflow of redundancy detection. (3) To compensate for the additional storage cost, SuperDelta applies a space-recycle scheme to remove derived data when they become old while ensuring the optimized restore performance of the latest backups. Experiments on four real-world backup datasets show that SuperDelta increases the overall compression ratio by 1.05 similar to 2.40 times than the traditional one-to-one fine-grained deduplication without significantly affecting the backup and restore throughput.

Block Size Optimization in Deduplication Systems

Reducing Data Fragmentation in Data Deduplication Systems via Partial Repetition and Coding

Optimization for Data De-duplication Algorithm Based on Storage Environment Aware

A Novel Optimization Method to Improve De-duplication Storage System Performance

A Novel Chunk Coalescing Algorithm for Data Deduplication in Cloud Storage

Optimization for Data De-Duplication Algorithm Based on File Content

A Thorough Investigation of Content-Defined Chunking Algorithms for Data Deduplication

Towards Cluster-wide Deduplication Based on Ceph

A Fast Duplicate Chunk Identifying Method Based on Hierarchical Indexing Structure

Improved Deduplication Method based on Variable-Size Sliding Window

Improving Data Availability for Deduplication in Cloud Storage

Using Multi-Threads to Hide Deduplication I/O Latency with Low Synchronization Overhead

Accelerating Content-Defined-chunking Based Data Deduplication by Exploiting Parallelism.

Leveraging Data Deduplication to Improve the Performance of Primary Storage Systems in the Cloud

Droplet: A Distributed Solution of Data Deduplication

Semantic data de-duplication for archival storage systems

The Design of Fast Content-Defined Chunking for Data Deduplication Based Storage Systems

Data De-Duplication with Adaptive Chunking and Accelerated Modification Identifying.

Speed-Dedup: A New Deduplication Framework for Enhanced Performance and Reduced Overhead in Scale-Out Storage

SuperDelta: Multiple Referenced Base Chunks Scheme for Fine-grained Deduplication Backup Storage System

A Comprehensive Study of the Past, Present, and Future of Data Deduplication