Crowdsourcing with Difficulty: A Bayesian Rating Model for Heterogeneous Items

Seong Woo Han,Ozan Adıgüzel,Bob Carpenter
2024-10-22
Abstract:In applied statistics and machine learning, the "gold standards" used for training are often biased and almost always noisy. Dawid and Skene's justifiably popular crowdsourcing model adjusts for rater (coder, annotator) sensitivity and specificity, but fails to capture distributional properties of rating data gathered for training, which in turn biases training. In this study, we introduce a general purpose measurement-error model with which we can infer consensus categories by adding item-level effects for difficulty, discriminativeness, and guessability. We further show how to constrain the bimodal posterior of these models to avoid (or if necessary, allow) adversarial raters. We validate our model's goodness of fit with posterior predictive checks, the Bayesian analogue of $\chi^2$ tests. Dawid and Skene's model is rejected by goodness of fit tests, whereas our new model, which adjusts for item heterogeneity, is not rejected. We illustrate our new model with two well-studied data sets, binary rating data for caries in dental X-rays and implication in natural language.
Machine Learning
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve several key problems in crowdsourced annotation data, especially those related to rater bias and noise. Specifically: 1. **Quality problems of annotated data**: In applied statistics and machine learning, the "gold standard" for training is usually biased and noisy. For example, although the classic crowdsourcing model of Dawid and Skene adjusts the sensitivity and specificity of raters, it fails to capture the distribution characteristics of rating data, resulting in bias during training. 2. **Introduction of difficulty, discrimination, and guessability**: To improve the quality of annotated data, this paper proposes a general measurement error model, which infers the consensus category by introducing item - level difficulty, discrimination, and guessability parameters. These parameters can better describe the characteristics of different items, thereby improving the model's adaptability to complex tasks. 3. **Handling adversarial raters**: This research also shows how to constrain bimodal posterior distributions to avoid (or allow when necessary) adversarial raters. Adversarial raters refer to those who deliberately provide incorrect annotations, which will affect the accuracy and reliability of the model. 4. **Model verification and evaluation**: The authors use posterior predictive checks and leave - one - out cross - validation to verify the goodness - of - fit and prediction accuracy of the model. These methods ensure the validity and robustness of the model. 5. **Practical application cases**: The paper conducts an empirical analysis through two typical datasets: one is binary rating data on whether dental X - rays have dental caries, and the other is sentence entailment relationship data in natural language processing. These two datasets show the application potential of this model in image recognition and natural language processing tasks. In summary, the main goal of this paper is to improve existing crowdsourcing annotation methods by introducing more complex parameterized models, thereby improving the quality of annotated data and the performance of the model.