Abstract:In crowdsourcing database, human operators are embedded into the database engine and collaborate with other conventional database operators to process the queries. Each human operator publishes small HITs (Human Intelligent Task) to the crowdsourcing platform, which consists of a set of database records and corresponding questions for human workers. The human workers complete the HITs and return the results to the crowdsourcing database for further processing. In practice, published records in HITs may contain sensitive attributes, probably causing privacy leakage so that malicious workers could link them with other public databases to reveal individual private information. Conventional privacy protection techniques, such as K-Anonymity , can be applied to partially solve the problem. However, after generalizing the data, the result of standard K-Anonymity algorithms may render uncontrollable information loss and affects the accuracy of crowdsourcing. In this paper, we first study the tradeoff between the privacy and accuracy for the human operator within data anonymization process. A probability model is proposed to estimate the lower bound and upper bound of the accuracy for general K-Anonymity approaches. We show that searching the optimal anonymity approach is NP-Hard and only heuristic approach is available. The second contribution of the paper is a general feedback-based K-Anonymity scheme. In our scheme, synthetic samples are published to the human workers, the results of which are used to guide the selection on anonymity strategies. We apply the scheme on Mondrian algorithm by adaptively cutting the dimensions based on our feedback results on the synthetic samples. We evaluate the performance of the feedback-based approach on U.S. census dataset, and show that given a predefined $K$ , our proposal outperforms standard K-Anonymity approaches on retaining the effectiveness of crowdsourcing.

Exploring Anonymous User Reviews: Linkability Analysis Based on Machine Learning

A Brief Survey on De-anonymization Attacks in Online Social Networks

Beyond Random Noise: Insights on Anonymization Strategies from a Latent Bandit Study

User Identity Linkage by Latent User Space Modelling.

AI-Driven Anonymization: Protecting Personal Data Privacy While Leveraging Machine Learning

K-Anonymity for Crowdsourcing Database

LinkMirage: How to Anonymize Links in Dynamic Social Systems

Evaluating the disclosure risk of anonymized documents via a machine learning-based re-identification attack

Anonymizing Machine Learning Models

PsyLink: User Identity Linkage via Psychological Characteristic Modeling

Unmasking Falsehoods in Reviews: An Exploration of NLP Techniques

RLINK: Deep Reinforcement Learning for User Identity Linkage

User Identity Linkage on Social Networks: A Review of Modern Techniques and Applications

A Review of Privacy-Preserving Machine Learning Classification

User Identity Linkage in Social Media Using Linguistic and Social Interaction Features

Profile Matching Across Unstructured Online Social Networks: Threats and Countermeasures

User Identity De-Anonymization Based On Attributes

Linky: Visualizing User Identity Linkage Results For Multiple Online Social Networks

Data De-anonymization : From Mobility Traces to On-line Social Networks

Detecting Anomalous Online Reviewers: An Unsupervised Approach Using Mixture Models

Online Social Network Profile Linkage