Measuring agreement among several raters classifying subjects into one-or-more (hierarchical) nominal categories. A generalisation of Fleiss' kappa

Filip Moons,Ellen Vandervieren

2023-03-22

Abstract:Cohen's and Fleiss' kappa are well-known measures for inter-rater reliability. However, they only allow a rater to select exactly one category for each subject. This is a severe limitation in some research contexts: for example, measuring the inter-rater reliability of a group of psychiatrists diagnosing patients into multiple disorders is impossible with these measures. This paper proposes a generalisation of the Fleiss' kappa coefficient that lifts this limitation. Specifically, the proposed $\kappa$ statistic measures inter-rater reliability between multiple raters classifying subjects into one-or-more nominal categories. These categories can be weighted according to their importance, and the measure can take into account the category hierarchy (e.g., categories consisting of subcategories that are only available when choosing the main category like a primary psychiatric disorder and sub-disorders; but much more complex dependencies between categories are possible as well). The proposed $\kappa$ statistic can handle missing data and a varying number of raters for subjects or categories. The paper briefly overviews existing methods allowing raters to classify subjects into multiple categories. Next, we derive our proposed measure step-by-step and prove that the proposed measure equals Fleiss' kappa when a fixed number of raters chose one category for each subject. The measure was developed to investigate the reliability of a new mathematics assessment method, of which an example is elaborated. The paper concludes with the worked-out example of psychiatrists diagnosing patients into multiple disorders.

Methodology,Statistics Theory

What problem does this paper attempt to address?

The paper attempts to address the issue of how to handle non-mutually exclusive categories and multiple-choice categories when evaluating the consistency of multiple raters' classifications of subjects. Specifically, traditional Cohen's Kappa and Fleiss' Kappa can only handle scenarios where each rater selects one category for each subject, which is not applicable in certain research contexts. For example, in diagnosing mental illnesses, patients may suffer from multiple conditions simultaneously, thus requiring a method that can handle multiple raters classifying subjects into one or more nominal categories. The paper proposes a new κ statistic, as an extension of Fleiss' Kappa, which can handle situations where multiple raters classify subjects into one or more categories. Additionally, these categories can be weighted according to their importance, and the method can consider the hierarchical structure of categories (e.g., main categories and subcategories). The newly proposed κ statistic can also handle missing data and variations in the number of subjects or categories. Through this method, the paper aims to improve the accuracy of inter-rater reliability assessment and be applicable to more complex research scenarios.

Measuring agreement among several raters classifying subjects into one-or-more (hierarchical) nominal categories. A generalisation of Fleiss' kappa

Kappa statistic considerations in evaluating inter-rater reliability between two raters: which, when and context matters

Measures of Agreement with Multiple Raters: Fréchet Variances and Inference

Testing the Normal Approximation and Minimal Sample Size Requirements of Weighted Kappa When the Number of Categories is Large

Fuzzy Kappa for the Agreement Measure of Fuzzy Classifications

Estimators of various kappa coefficients based on the unbiased estimator of the expected index of agreements

Interrater agreement statistics under the two-rater dichotomous-response case with correlated decisions

Assessing agreement on classification tasks: the kappa statistic

Interrater reliability: the kappa statistic

Liberal-Conservative Hierarchies of Intercoder Reliability Estimators

The Kappa Paradox Explained

k-Rater Reliability: The Correct Unit of Reliability for Aggregated Human Annotations

Statistical inference for agreement between multiple raters on a binary scale

Assessing method agreement for paired repeated binary measurements administered by multiple raters

Interrater reliability for multilevel data: A generalizability theory approach.

sklarsomega: An R Package for Measuring Agreement Using Sklar's Omega Coefficient

Kappa Learning: A New Method for Measuring Similarity Between Educational Items Using Performance Data

Why Cohen’s Kappa should be avoided as performance measure in classification

Statistical models for assessing agreement for quantitative data with heterogeneous random raters and replicate measurements

Measuring Annotator Agreement Generally across Complex Structured, Multi-object, and Free-text Annotation Tasks