Abstract:Assessing classification confidence is critical for leveraging large language models (LLMs) in automated labeling tasks, especially in the sensitive domains presented by Computational Social Science (CSS) tasks. In this paper, we make three key contributions: (1) we propose an uncertainty quantification (UQ) performance measure tailored for data annotation tasks, (2) we compare, for the first time, five different UQ strategies across three distinct LLMs and CSS data annotation tasks, (3) we introduce a novel UQ aggregation strategy that effectively identifies low-confidence LLM annotations and disproportionately uncovers data incorrectly labeled by the LLMs. Our results demonstrate that our proposed UQ aggregation strategy improves upon existing methods andcan be used to significantly improve human-in-the-loop data annotation processes.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address the issue of how to assess the confidence of large language models (LLMs) in zero-shot classification tasks within Computational Social Science (CSS). Specifically, the paper focuses on the following points: 1. **Importance of Evaluating Classification Confidence**: In automated annotation tasks, especially in sensitive CSS tasks, evaluating the confidence of LLM classifications is crucial. If the labels generated by LLMs are mistakenly considered correct, it could lead to serious consequences. 2. **Limitations of Existing Methods**: Although some LLMs can express uncertainty, developers often restrict the model's output to manage non-deterministic behavior or reduce generation costs. These restrictions may cause LLMs to provide confident answers even when lacking correct knowledge. 3. **Challenges of Zero-Shot Classification**: Effectively quantifying the confidence of labels generated by LLMs without prior training data is a significant issue. Existing Uncertainty Quantification (UQ) methods often require labeled datasets, which is not feasible in many real-world scenarios. ### Main Contributions To address the above issues, the paper makes the following three main contributions: 1. **Proposing a New UQ Performance Metric**: The paper introduces a UQ performance metric specifically designed for data annotation tasks. 2. **Comparing Different UQ Strategies**: For the first time, the paper compares five different UQ strategies on the same set of LLMs and CSS data annotation tasks. 3. **Introducing a New UQ Aggregation Strategy**: The paper proposes a new UQ aggregation strategy that effectively identifies low-confidence LLM annotations and significantly detects data mislabelled by LLMs. Experimental results show that this aggregation strategy outperforms existing methods and can significantly improve the efficiency of human-involved data annotation processes. ### Experimental Design The paper evaluates five UQ techniques on three different LLMs (Llama-3.1 8B Instruct, Flan UL2, GPT-4o) and three different CSS tasks (stance detection, ideology detection, frame detection). Data from each LLM and task are sorted by confidence from low to high to sample low-confidence data for human annotation or high-confidence data for downstream classifiers. ### Results Experimental results show that the **Confidence Ensemble** is the most robust UQ strategy, performing best across all model types. For individual LLMs, using the difference between the top two highest log probabilities is also an effective UQ mechanism. Additionally, the paper introduces a new evaluation metric, the ability to recall mislabelled data at low confidence, measured by the Area Under Curve (AUC) to assess the performance of different UQ strategies. ### Conclusion By evaluating several easy-to-implement UQ-based sampling strategies, the paper finds that using a confidence ensemble is the most effective method for identifying mislabelled data. When only one LLM is available, using the difference between the top two highest log probabilities is also an effective UQ mechanism. Using LLMs to annotate CSS data is a rapidly developing trend, but human evaluation of the quality of generated labels remains very important. The UQ strategies proposed in the paper indicate that by reviewing a small amount of data driven by uncertainty quantification, a disproportionate amount of mislabelled data can be identified, which should be evaluated by humans.

LLM Confidence Evaluation Measures in Zero-Shot CSS Classification

CSS: Contrastive Semantic Similarity for Uncertainty Quantification of LLMs

Can Unconfident LLM Annotations Be Used for Confident Conclusions?

Generating with Confidence: Uncertainty Quantification for Black-box Large Language Models

Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs

Black-box Uncertainty Quantification Method for LLM-as-a-Judge

LUQ: Long-text Uncertainty Quantification for LLMs

Large Language Model Confidence Estimation via Black-Box Access

Semantic Density: Uncertainty Quantification for Large Language Models through Confidence Measurement in Semantic Space

Benchmarking LLMs via Uncertainty Quantification

Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph

Can Large Language Models Transform Computational Social Science?

Label-Confidence-Aware Uncertainty Estimation in Natural Language Generation

SPUQ: Perturbation-Based Uncertainty Quantification for Large Language Models

Factual Confidence of LLMs: on Reliability and Robustness of Current Estimators

Cycles of Thought: Measuring LLM Confidence through Stable Explanations

Reconfidencing LLMs from the Grouping Loss Perspective

A Comprehensive Study of Multilingual Confidence Estimation on Large Language Models

Legitimate ground-truth-free metrics for deep uncertainty classification scoring

Multicalibration for Confidence Scoring in LLMs