Privacy-preserving Data Compression Scheme for K-Anonymity Model Based on Huffman Coding

YU Yue,LIN Xianzheng,LI Weihai,YU Nenghai
DOI: https://doi.org/10.11959/j.issn.2096-109x.2023054
2023-01-01
Abstract:The k-anonymity model is widely used as a data anonymization technique for privacy protection during the data release phase.However, with the advent of the big data era, the generation of vast amounts of data poses challenges to data storage.However, it is not feasible to expand the storage space infinitely by hardware upgrade, since the cost of memory is high and the storage space is limited.For this reason, data compression techniques can reduce storage costs and communication overhead.In order to reduce the storage space of the data generated by using anonymization techniques in the data publishing phase, a compression scheme was proposed for the original data and anonymized data of the k-anonymity model.For the original data of the k-anonymity model, the difference between the original data and the anonymized data was calculated according to the set rules and the pre-defined generalization level.Huffman coding compression was applied to the difference data according to frequency characteristics.By storing the difference data, the original data can be obtained indirectly, thus reducing the storage space of the original data.For anonymized data of the k-anonymity model, the anonymized data usually have high repeatability according to the generalization rules of the model or the pre-defined generalization hierarchy relations.The larger the value of k, the more generalized and repeatable the anonymized data becomes.The design of Huffman coding compression was implemented for anonymous data to reduce storage space.The experimental results show that the proposed scheme can significantly reduce the original data and the anonymous data compression rate of the k-anonymity model.Across five models and variousk-value settings,the proposed scheme reduces the compression rate of raw and anonymized data by 72.2% and 64.2% on average compared to the Windows 11 zip tool.
What problem does this paper attempt to address?