A Focused Garbage Collection Approach for Primary Deduplicated Storage with Low Memory Overhead

Jingsong Yuan,Xiangyu Zou,Han Xu,Zhichao Cao,Shiyi Li,Wen Xia,Peng Wang,Li Chen
DOI: https://doi.org/10.1109/ICCD56317.2022.00053
2022-01-01
Abstract:Since one chunk could be shared by many files after data deduplication, Garbage Collection (GC) is an essential but complex task to reclaim stale chunks in large-scale primary deduplication systems. Traditional Mark&Sweep is a widely used approach but suffers from the increasingly traversing time and huge memory overhead of Liveness Array (i.e., a data structure reflects the liveness of alive chunks) in the Mark phase. This paper proposes a new method named Focused Garbage Collection (FGC) to accelerate the Mark phase for primary deduplication storage significantly. Specifically, we design a global Austere Reference Graph with low memory cost that efficiently represents files’ reference relationships (i.e., sharing chunks after deduplication) by considering the deduplication characteristics of workloads in primary systems. Austere Reference Graph helps FGC focus on the deleted files and their correlative files to quickly mark stale chunks, while traditional approaches need to traverse all files. Consequently, FGC’s traversing time and Liveness Array size will be greatly reduced in the Mark phase. Evaluation results show that compared with traditional Mark&Sweep, FGC decreases the time consumption in the Mark phase 1.3×-7.34× in a stand-alone primary deduplication system and 128×-256× network traffic reduction for the Mark phase while only introducing < 0.05% extra memory overhead for the reference graph.
What problem does this paper attempt to address?