Distributed Backup Data Deduplication System Based on Data Routing

Min YAO,Jianwei YIN,Yan TANG,Zhiling LUO
DOI: https://doi.org/10.3969/j.issn.1000-3428.2017.02.015
2017-01-01
Abstract:In big data scenarios,traditional data deduplication backup system faces with defects like large data backup storage space,insufficient data throughput and so on.Aiming at these defects,this paper designs a distributed backup data dedeplication system based on data routing.It uses data chunk as deduplication granularity,whose functions involve data routing and data prefetching.Data routing uses the Bloom filter to query data chunks to be processed,and applies average sampling and neighbor sampling based on Jaccard distance to prefetch data chunks.This system uses data routing to assign data chunks to the corresponding processing nodes to deal with.Data chunks' hash code obtained through average sampling provides routing information for data routing.And data chunks' hash code obtained through neighbor sampling is used for the first data deduplication of the system.Experimental results show that the data throughput of this system increases significantly compared with all processing node query and fixed data routing,while maintaining the deduplication ratio.
What problem does this paper attempt to address?