Zero-Chunk: An Efficient Cache Algorithm to Accelerate the I/O Processing of Data Deduplication
Hongyuan Gao,Chentao Wu,Jie Li,Minyi Guo
DOI: https://doi.org/10.1109/ICPADS.2016.0089
2016-01-01
Abstract:Data deduplication is a technique to eliminate duplicated copies of data. It can save the storage space, reduce the amount of disk I/Os, then improve the system performance. There have been several popular deduplication algorithms such as SISL [30], Extreme Binning [1], Sparse Indexing [14], etc. These schemes use containers to aggregate data chunks for better performance. However, they either suffer from low cache hit ratios or inefficient cache utilization. To address this problem, we design Zero-Chunk, a new cache algorithm that balances the cache hit ratio and memory usage. In our method, we choose chunks whose fingerprints have all-zero remainders as pointers (called zero chunks), and aggregate the following chunks into their corresponding containers. And then, when the access patterns change, our method can eliminate cold data chunks and containers to maintain a low overhead. To demonstrate the effectiveness of Zero-Chunk, we conduct several simulations. The results show that, compared to Sparse Indexing (the most popular implementation method in data deduplication), Zero-Chunk improves the cache hit ratio by up to 5.2%, saves the memory consumption by more than 50.7%, and decreases the total number of I/Os by up to 17.3%, respectively.