DS-Dedupe: A scalable, low network overhead data routing algorithm for inline cluster deduplication system

Zhen Sun,Nong Xiao,Fang Liu,Yinjin Fu
DOI: https://doi.org/10.1109/ICCNC.2014.6785456
2014-01-01
Abstract:Inline cluster deduplication technique has been widely used in data centers to improve storage efficiency. Data routing algorithm has a crucial impact on the deduplication factor, throughput and scalability in a cluster deduplication system. In this paper, we propose a stateful data routing algorithm called DS-Dedupe. To make full use of similarity in data streams, DS-Dedupe builds up a super-chunk granularity similarity index in each client to trace the super-chunks that have been routed. Then we calculate a similarity coefficient according to the index to determine whether a new super-chunk should be assigned directly or by a consistent hash, thus strike a sensible tradeoff between deduplication factor and network overhead. Our experiments on two datasets demonstrate that DS-Dedupe achieves a high elimination ratio at a low communication overhead. Besides, as data routing is operated by client node, metadata server bottleneck can be avoided.
What problem does this paper attempt to address?