Abstract:Data deduplication is an efficient technology to reduce storage cost for cloud storage systems, especially when massive volume of data has become normalcy in this era of Big Data. Primary storage, as the direct interaction layer with service users, has reaped the benefit of deduplication technologies due to its expensive manufacturing cost. However, since primary storage is constantly accessed by users, workloads of primary storage systems are mostly latency-sensitive. Such workload feature makes it challenging to develop both performance and space efficient deduplication schemes for primary storage systems. Existing deduplication schemes on primary storage pay little attention to achieving desirable space saving while restraining the inherent performance penalty to a little extent.In this paper, we propose SAUD, a Semantics-Aware and Utility-Driven deduplication framework to provide high space saving with minor performance penalty for primary storage. SAUD delivers performance-oriented deduplication service by leveraging the file-level semantics of primary storage in a quantitative way. SAUD calculates deduplication priority of files with diverse semantics as deduplicating instructions. Moreover, SAUD operates in a selective-on mode by dynamically regulating the deduplication process based on the real-time workload and system status, further reducing the side-effect on system performance. Comprehensive evaluations show that SAUD outperforms all other comparative schemes on system read performance by an average of 54.6%. SAUD manages to achieve 82.1% of the space efficiency achieved by the most space-orient scheme, read performance of which falls behind that of SAUD by as much as 80.1%.

GLE-Dedup: A Globally–Locally Even Deduplication by Request-Aware Placement for Better Read Performance

A Delayed Container Organization Approach to Improve Restore Speed for Deduplication Systems.

PLC-cache: Endurable SSD Cache for Deduplication-Based Primary Storage

Speed-Dedup: A New Deduplication Framework for Enhanced Performance and Reduced Overhead in Scale-Out Storage

Endurable SSD-Based Read Cache for Improving the Performance of Selective Restore from Deduplication Systems

An Optimized Learning-Based Directory Placement Policy with Two-Rounds Selection in Distributed File Systems

Try Managing Your Deduplication Fine-Grained-ly: A Multi-tiered and Dynamic SLA-Driven Deduplication Framework for Primary Storage.

LDPP: A Learned Directory Placement Policy in Distributed File Systems.

A Novel Optimization Method to Improve De-duplication Storage System Performance

D3: A Dynamic Dual-Phase Deduplication Framework for Distributed Primary Storage.

Sliding Look-Back Window Assisted Data Chunk Rewriting For Improving Deduplication Restore Performance

Towards Cluster-wide Deduplication Based on Ceph

DIODE: Dynamic Inline-Offline DE Duplication Providing Efficient Space-Saving and Read/Write Performance for Primary Storage Systems.

SAUD: Semantics-Aware and Utility-Driven Deduplication Framework for Primary Storage.

A Focused Garbage Collection Approach for Primary Deduplicated Storage with Low Memory Overhead

AdaptMD: Balancing Space and Performance in NUMA Architectures with Adaptive Memory Deduplication

Ss-Dedup : A High Throughput Stateful Data Routing Algorithm For Cluster Deduplication System

PeerDedupe: Insights into the Peer-Assisted Sampling Deduplication.

MUSE: A Multi-Tierd and SLA-Driven Deduplication Framework for Cloud Storage Systems

Low‐overhead inline deduplication for persistent memory

SORD: a new strategy of online replica deduplication in Cloud-P2P