A general framework for privacy-preserving of data publication based on randomized response techniques

Chaobin Liu,Shixi Chen,Shuigeng Zhou,Jihong Guan,Yao Ma
DOI: https://doi.org/10.1016/j.is.2020.101648
IF: 3.18
2021-02-01
Information Systems
Abstract:<p>Privacy preserving is a paramount concern in publishing datasets that contain sensitive information. Preventing privacy disclosure and providing useful information to legitimate users for data analyzing/mining are conflicting goals. <em>Randomized response</em> is a class of techniques that perturbs each sensitive value in a certain way, so that personal privacy is protected while the large-trend of the entire dataset is still recoverable. However, existing randomized response techniques do not allow to flexibly configure the level of privacy protection, support only a few types of aggregate queries, and can not achieve the best answer accuracy from perturbed data. These drawbacks impair the effectiveness of those techniques. This paper proposes a general framework based on randomized response techniques, which has good flexibility and extensibility, and can improve the effectiveness of randomized response methods. Our approach is validated by extensive experiments and comparison with existing randomized response and generalization methods.</p>
computer science, information systems
What problem does this paper attempt to address?
The paper attempts to address the issue of protecting individual privacy when releasing datasets containing sensitive information. Specifically, the authors focus on how to provide useful information to legitimate users for data analysis or mining while preventing privacy breaches. Existing randomized response techniques suffer from a lack of flexibility, support for only a few types of aggregate queries, and the inability to obtain optimal answer accuracy from perturbed data, which affects the effectiveness of these techniques. Therefore, this paper proposes a general framework based on randomized response techniques aimed at improving the flexibility, scalability, and effectiveness of randomized response methods. The main contributions of the paper include: 1. Proposing a general framework for data release based on randomized response techniques, which reduces the computational complexity of reconstructing unbiased estimated answers from exponential correlation to linear correlation by utilizing matrix decomposition methods and the properties of the Kronecker product. 2. Proposing a general method for constructing recovery matrices from arbitrary perturbation matrices, which can minimize the variance of unbiased estimated answers. 3. Developing perturbation and reconstruction algorithms for Boolean attributes and categorical attributes, and providing theoretical analysis. These algorithms can be extended to numerical attributes. 4. Validating the effectiveness of the proposed framework through extensive experiments and comparisons with existing randomized response and generalization methods.