LLM Confidence Evaluation Measures in Zero-Shot CSS Classification

David Farr,Iain Cruickshank,Nico Manzonelli,Nicholas Clark,Kate Starbird,Jevin West
2024-10-17
Abstract:Assessing classification confidence is critical for leveraging large language models (LLMs) in automated labeling tasks, especially in the sensitive domains presented by Computational Social Science (CSS) tasks. In this paper, we make three key contributions: (1) we propose an uncertainty quantification (UQ) performance measure tailored for data annotation tasks, (2) we compare, for the first time, five different UQ strategies across three distinct LLMs and CSS data annotation tasks, (3) we introduce a novel UQ aggregation strategy that effectively identifies low-confidence LLM annotations and disproportionately uncovers data incorrectly labeled by the LLMs. Our results demonstrate that our proposed UQ aggregation strategy improves upon existing methods andcan be used to significantly improve human-in-the-loop data annotation processes.
Human-Computer Interaction,Computation and Language,Information Retrieval
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address the issue of how to assess the confidence of large language models (LLMs) in zero-shot classification tasks within Computational Social Science (CSS). Specifically, the paper focuses on the following points: 1. **Importance of Evaluating Classification Confidence**: In automated annotation tasks, especially in sensitive CSS tasks, evaluating the confidence of LLM classifications is crucial. If the labels generated by LLMs are mistakenly considered correct, it could lead to serious consequences. 2. **Limitations of Existing Methods**: Although some LLMs can express uncertainty, developers often restrict the model's output to manage non-deterministic behavior or reduce generation costs. These restrictions may cause LLMs to provide confident answers even when lacking correct knowledge. 3. **Challenges of Zero-Shot Classification**: Effectively quantifying the confidence of labels generated by LLMs without prior training data is a significant issue. Existing Uncertainty Quantification (UQ) methods often require labeled datasets, which is not feasible in many real-world scenarios. ### Main Contributions To address the above issues, the paper makes the following three main contributions: 1. **Proposing a New UQ Performance Metric**: The paper introduces a UQ performance metric specifically designed for data annotation tasks. 2. **Comparing Different UQ Strategies**: For the first time, the paper compares five different UQ strategies on the same set of LLMs and CSS data annotation tasks. 3. **Introducing a New UQ Aggregation Strategy**: The paper proposes a new UQ aggregation strategy that effectively identifies low-confidence LLM annotations and significantly detects data mislabelled by LLMs. Experimental results show that this aggregation strategy outperforms existing methods and can significantly improve the efficiency of human-involved data annotation processes. ### Experimental Design The paper evaluates five UQ techniques on three different LLMs (Llama-3.1 8B Instruct, Flan UL2, GPT-4o) and three different CSS tasks (stance detection, ideology detection, frame detection). Data from each LLM and task are sorted by confidence from low to high to sample low-confidence data for human annotation or high-confidence data for downstream classifiers. ### Results Experimental results show that the **Confidence Ensemble** is the most robust UQ strategy, performing best across all model types. For individual LLMs, using the difference between the top two highest log probabilities is also an effective UQ mechanism. Additionally, the paper introduces a new evaluation metric, the ability to recall mislabelled data at low confidence, measured by the Area Under Curve (AUC) to assess the performance of different UQ strategies. ### Conclusion By evaluating several easy-to-implement UQ-based sampling strategies, the paper finds that using a confidence ensemble is the most effective method for identifying mislabelled data. When only one LLM is available, using the difference between the top two highest log probabilities is also an effective UQ mechanism. Using LLMs to annotate CSS data is a rapidly developing trend, but human evaluation of the quality of generated labels remains very important. The UQ strategies proposed in the paper indicate that by reviewing a small amount of data driven by uncertainty quantification, a disproportionate amount of mislabelled data can be identified, which should be evaluated by humans.