Abstract:Data labeling, which assigns data with multiple classes, is indispensable for many applications, such as machine learning and data integration. However, existing labeling solutions either incur expensive cost for large datasets or produce noisy results. This paper introduces a cost-effective labeling approach and focuses on the labeling rule generation problem that aims to generate high-quality rules to largely reduce the labeling cost while preserving quality. To address the problem, we first generate candidate rules and then devise a game-based crowdsourcing approach CrowdGame to select high-quality rules by considering coverage and accuracy . CrowdGame employs two groups of crowd workers: One group answers rule validation tasks (whether a rule is valid) to play a role of rule generator , while the other group answers tuple checking tasks (whether the label of a data tuple is correct) to play a role of rule refuter . We let the two groups play a two-player game: Rule generator identifies high-quality rules with large coverage, while rule refuter tries to refute its opponent rule generator by checking some tuples that provide enough evidence to reject rules with low accuracy. This paper studies the challenges in CrowdGame . The first is to balance the trade-off between coverage and accuracy. We define the loss of a rule by considering the two factors. The second is rule accuracy estimation. We utilize Bayesian estimation to combine both rule validation and tuple checking tasks. The third is to select crowdsourcing tasks to fulfill the game-based framework for minimizing the loss. We introduce a minimax strategy and develop efficient task selection algorithms. We also develop a hybrid crowd-machine method for effective label assignment under budget-constrained crowdsourcing settings. We conduct experiments on entity matching and relation extraction, and the results show that our method outperforms state-of-the-art solutions.

Cost-Effective Data Annotation Using Game-Based Crowdsourcing

A Game-Based Framework for Crowdsourced Data Labeling

Crowdgame: A Game-Based Crowdsourcing System For Cost-Effective Data Labeling

Human-centred Design on Crowdsourcing Annotation Towards Improving Active Learning Model Performance

Learning from Crowds under Experts' Supervision

Cost-efficient Crowdsourcing for Span-based Sequence Labeling: Worker Selection and Data Augmentation

COCA: Cost-Effective Collaborative Annotation System by Combining Experts and Amateurs

Efficient Online Crowdsourcing with Complex Annotations

Hierarchical Crowdsourcing for Data Labeling with Heterogeneous Crowd.

Distribution-Aware Crowdsourced Entity Collection

A GAMIFICATION APPROACH FOR THE IMPROVEMENT OF PAID CROWD-BASED LABELLING OF GEOSPATIAL DATA

Quality-Aware Incentive Mechanisms Under Social Influences in Data Crowdsourcing

An Online Learning Approach to Improving the Quality of Crowd-Sourcing

Crowdsourcing Label Quality: A Theoretical Analysis

Cost-Saving Effect of Crowdsourcing Learning

Visualizing NLP annotations for Crowdsourcing

Crowdsourcing System for Multi-object Annotation in Surveillance Videos

No Need to Sacrifice Data Quality for Quantity: Crowd-Informed Machine Annotation for Cost-Effective Understanding of Visual Data

A Comparative Study on Annotation Quality of Crowdsourcing and LLM via Label Aggregation

Collusion Detection and Ground Truth Inference in Crowdsourcing for Labeling Tasks.

Analysis and Research Based on the Crowdsourcing Corpus System in Guangdong-Hong Kong-Macao Greater Bay Area (GBA)