Abstract:The topic of annotating legal data has received surprisingly little attention. A key challenge of the annotation process is reaching a sufficient agreement between annotators and filtering mistakes from genuine disagreement. This study presents an approach that provides insights into and resolves potential disagreement amongst annotators. It (1) introduces different strategies to calculate agreement levels and compares (2) agreement levels between annotators (inter-annotator agreement) before and after a revision round and (3) agreement levels for annotators who annotate the same texts twice (intra-annotator agreement). The inter-annotator agreement levels are compared to a revision round in which an arbiter corrected the annotator's labels. The analysis is based on the annotation of EU legislative provisions at two stages (initial annotations, after annotator revisions) and for various tasks (Definitions, References, Quantities, IF-THEN statements, Exceptions, Scope, Hierarchy, Deontic Clauses, Active and Passive Role) by multiple annotators. The results reveal that agreement levels vary based on the stage of measurement (before/after revisions), the nature of the task, the method of assessment, and the annotator combination. The agreement scores - along with some initial measurements—align with those reported in previous research but increase after each revision round. This suggests that annotator revisions can substantially reduce disagreement. Additionally, disagreements were found not only between but also among annotators. This inconsistency does not appear to stem from a lack of understanding of the guidelines or a lack of seriousness in task execution, as evidenced by moderate to substantial inter-annotator agreement scores. These findings suggest that annotators identified multiple valid interpretations, which highlights the complexity of annotating legislative provisions. The results underscore the significance of embracing, addressing, and reporting about (dis)agreement in different ways and at the various stages of an annotation task.

Sparse Probability of Agreement

Learning directed acyclic graphs based on sparsest permutations

Consistency is Key: Disentangling Label Variation in Natural Language Processing with Intra-Annotator Agreement

SPADE: Sequential-clustering Particle Annihilation via Discrepancy Estimation

Comparing two spatial variables with the probability of agreement

Measuring Annotator Agreement Generally across Complex Structured, Multi-object, and Free-text Annotation Tasks

Efficient Probabilistic Latent Semantic Analysis with Sparsity Control

Improving Statistical Significance in Human Evaluation of Automatic Metrics via Soft Pairwise Accuracy

Estimating Agreement by Chance for Sequence Annotation

Deciphering disagreement in the annotation of EU legislation

Distributed Sparse Total Least-Squares over Networks

On a Near-Optimal \& Efficient Algorithm for the Sparse Pooled Data Problem

A Study on Agreement in PICO Span Annotations

GRASP: A Disagreement Analysis Framework to Assess Group Associations in Perspectives

SparsePO: Controlling Preference Alignment of LLMs via Sparse Token Masks

Assessing agreement on classification tasks: the kappa statistic

High Agreement and High Prevalence: The Paradox of Cohen’s Kappa

Trust or Escalate: LLM Judges with Provable Guarantees for Human Agreement

Annotation Efficiency: Identifying Hard Samples via Blocked Sparse Linear Bandits

Leveraging Sparsity for Efficient Submodular Data Summarization

Statistical inference for agreement between multiple raters on a binary scale