Abstract:Human label variation (HLV) is a valuable source of information that arises when multiple human annotators provide different labels for valid reasons. In Natural Language Inference (NLI) earlier approaches to capturing HLV involve either collecting annotations from many crowd workers to represent human judgment distribution (HJD) or use expert linguists to provide detailed explanations for their chosen labels. While the former method provides denser HJD information, obtaining it is resource-intensive. In contrast, the latter offers richer textual information but it is challenging to scale up to many human judges. Besides, large language models (LLMs) are increasingly used as evaluators ("LLM judges") but with mixed results, and few works aim to study HJDs. This study proposes to exploit LLMs to approximate HJDs using a small number of expert labels and explanations. Our experiments show that a few explanations significantly improve LLMs' ability to approximate HJDs with and without explicit labels, thereby providing a solution to scale up annotations for HJD. However, fine-tuning smaller soft-label aware models with the LLM-generated model judgment distributions (MJDs) presents partially inconsistent results: while similar in distance, their resulting fine-tuned models and visualized distributions differ substantially. We show the importance of complementing instance-level distance measures with a global-level shape metric and visualization to more effectively evaluate MJDs against human judgment distributions.

What problem does this paper attempt to address?

The core problem that this paper attempts to solve is: **Can large - language models (LLMs) approximate the human judgment distributions (HJDs) in natural - language - inference (NLI) tasks through a small number of expert explanations?** Specifically, the researchers hope that by using the labels and explanations provided by a small number of experts, LLMs can better simulate the human judgment distributions collected from a large number of annotators. In addition, they also explore whether these model judgment distributions (MJDs) generated by LLMs can be used as soft labels to fine - tune smaller models for prediction. ### Main Research Questions 1. **Can LLMs better approximate the human judgment distributions collected from a large number of annotators through a small number of detailed explanations?** 2. **Are the model judgment distributions (MJDs) generated by LLMs suitable as soft labels for fine - tuning smaller models to predict distributions?** ### Research Background In natural - language processing (NLP), human label variation (HLV) refers to the situation where multiple annotators provide different but reasonable labels for the same task. This variation can be caused by internal differences, subjectivity, or multiple reasonable explanations. For NLI tasks, previous methods either collect a large number of annotations through crowdsourcing workers to represent human judgment distributions, or obtain detailed explanations from experts. The former provides more intensive HJD information but is extremely resource - consuming; the latter provides more abundant text information but is difficult to scale to a large number of annotators. ### Solution This research proposes a new method, that is, using LLMs to approximate HJD through the labels and explanations provided by a small number of experts. The experimental results show that a small number of explanations significantly improve the ability of LLMs to approximate HJD, whether there are explicit labels or not. However, when using the MJD generated by LLM to fine - tune smaller models, the results are partially inconsistent: although the distances are similar, the fine - tuned models and the visualized distributions are quite different. ### Experimental Design - **Dataset**: Two NLI datasets containing HLV - Chaos NLI and VariErr NLI - were used. - **Model**: Two open - source large - scale language models - Mixtral and Llama3 - were used. - **Experimental Type**: - **Distribution Comparison**: Compare the differences between the MJD generated by LLM and the human judgment distribution (HJD). - **Fine - Tuning Comparison**: Evaluate the effect of the generated MJD as a soft label for fine - tuning smaller models. ### Experimental Results - **Distribution Comparison**: The results show that after adding explanations, the MJDs generated by LLMs gradually approach HJD, especially when using explicit explanations in parallel mode has the best effect. - **Fine - Tuning Comparison**: Although it performs well in terms of KL divergence and cross - entropy loss, it shows different results in terms of F1 score. Overall, adding explicit explanations helps to obtain better model performance, especially for Llama3. ### Conclusion This research shows that through a small number of expert explanations, LLMs can approximate human judgment distributions to a certain extent, and these generated MJDs can be used as soft labels to fine - tune smaller models. However, the fine - tuning effect depends on the specific explanation method and model selection.

"Seeing the Big through the Small": Can LLMs Approximate Human Judgment Distributions on NLI from a Few Explanations?

Using Natural Language Explanations to Rescale Human Judgments

Human-LLM Collaborative Annotation Through Effective Verification of LLM Labels

Can LLM be a Personalized Judge?

DHP Benchmark: Are LLMs Good NLG Evaluators?

Lost in Inference: Rediscovering the Role of Natural Language Inference for Large Language Models

Evaluating Explanations Through LLMs: Beyond Traditional User Studies

MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark

Reasons to Reject? Aligning Language Models with Judgments

Dissecting Human and LLM Preferences

Human-Centered Design Recommendations for LLM-as-a-Judge

Are LLMs Better than Reported? Detecting Label Errors and Mitigating Their Effect on Model Performance

LLM Processes: Numerical Predictive Distributions Conditioned on Natural Language

Judging LLM-as-a-judge with MT-Bench and Chatbot Arena

Leveraging LLMs for Dialogue Quality Measurement

Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges

Bayesian Statistical Modeling with Predictors from LLMs

HLB: Benchmarking LLMs' Humanlikeness in Language Use

Alignment Between the Decision-Making Logic of LLMs and Human Cognition: A Case Study on Legal LLMs

Understanding the Role of LLMs in Multimodal Evaluation Benchmarks