Abstract:Human annotations are vital to supervised learning, yet annotators often disagree on the correct label, especially as annotation tasks increase in complexity. A strategy to improve label quality is to ask multiple annotators to label the same item and aggregate their labels. Many aggregation models have been proposed for categorical or numerical annotation tasks, but far less work has considered more complex annotation tasks involving open-ended, multivariate, or structured responses. While a variety of bespoke models have been proposed for specific tasks, our work is the first to introduce aggregation methods that generalize across many diverse complex tasks, including sequence labeling, translation, syntactic parsing, ranking, bounding boxes, and keypoints. This generality is achieved by devising a task-agnostic method to model distances between labels rather than the labels themselves. This article extends our prior work with investigation of three new research questions. First, how do complex annotation properties impact aggregation accuracy? Second, how should a task owner navigate the many modeling choices to maximize aggregation accuracy? Finally, what diagnoses can verify that aggregation models are specified correctly for the given data? To understand how various factors impact accuracy and to inform model selection, we conduct simulation studies and experiments on real, complex datasets. Regarding testing, we introduce unit tests for aggregation models and present a suite of such tests to ensure that a given model is not mis-specified and exhibits expected behavior. Beyond investigating these research questions above, we discuss the foundational concept of annotation complexity, present a new aggregation model as a bridge between traditional models and our own, and contribute a new semi-supervised learning method for complex label aggregation that outperforms prior work.

To Aggregate or Not to Aggregate. That is the Question: A Case Study on Annotation Subjectivity in Span Prediction

Corpus Considerations for Annotator Modeling and Scaling

A Structured Span Selector

Modeling Legal Reasoning: LM Annotation at the Edge of Human Agreement

Span Identification of Epistemic Stance-Taking in Academic Written English

Aggregation Artifacts in Subjective Tasks Collapse Large Language Models' Posteriors

A General Model for Aggregating Annotations Across Simple, Complex, and Multi-Object Annotation Tasks

MaP: A Matrix-based Prediction Approach to Improve Span Extraction in Machine Reading Comprehension

Gaps or Hallucinations? Gazing into Machine-Generated Legal Analysis for Fine-grained Text Evaluations

Emotion-Aware, Emotion-Agnostic, or Automatic: Corpus Creation Strategies to Obtain Cognitive Event Appraisal Annotations

Multi-Fact Correction in Abstractive Text Summarization

Empirical legal analysis simplified: reducing complexity through automatic identification and evaluation of legally relevant factors

Dissecting Span Identification Tasks with Performance Prediction

When Reasoning Meets Information Aggregation: A Case Study with Sports Narratives

Human-LLM Hybrid Text Answer Aggregation for Crowd Annotations

CoAnnotating: Uncertainty-Guided Work Allocation between Human and Large Language Models for Data Annotation

GRASP: A Disagreement Analysis Framework to Assess Group Associations in Perspectives

A Corpus for Sentence-level Subjectivity Detection on English News Articles

Earlier Isn't Always Better: Sub-aspect Analysis on Corpus and System Biases in Summarization

Cost-Efficient Subjective Task Annotation and Modeling through Few-Shot Annotator Adaptation

Learning Span-Level Interactions for Aspect Sentiment Triplet Extraction