Confidence Estimation for LLM-Based Dialogue State Tracking

Yi-Jyun Sun,Suvodip Dey,Dilek Hakkani-Tur,Gokhan Tur

2024-09-21

Abstract:Estimation of a model's confidence on its outputs is critical for Conversational AI systems based on large language models (LLMs), especially for reducing hallucination and preventing over-reliance. In this work, we provide an exhaustive exploration of methods, including approaches proposed for open- and closed-weight LLMs, aimed at quantifying and leveraging model uncertainty to improve the reliability of LLM-generated responses, specifically focusing on dialogue state tracking (DST) in task-oriented dialogue systems (TODS). Regardless of the model type, well-calibrated confidence scores are essential to handle uncertainties, thereby improving model performance. We evaluate four methods for estimating confidence scores based on softmax, raw token scores, verbalized confidences, and a combination of these methods, using the area under the curve (AUC) metric to assess calibration, with higher AUC indicating better calibration. We also enhance these with a self-probing mechanism, proposed for closed models. Furthermore, we assess these methods using an open-weight model fine-tuned for the task of DST, achieving superior joint goal accuracy (JGA). Our findings also suggest that fine-tuning open-weight LLMs can result in enhanced AUC performance, indicating better confidence score calibration.

Computation and Language,Artificial Intelligence

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to improve the reliability and performance of dialogue state tracking (DST) in task - oriented dialogue systems (TODS) based on large - language models (LLM), especially by estimating the confidence of the model to reduce hallucination generation and prevent over - reliance on the model. Specifically, the paper explores multiple methods to quantify and utilize model uncertainty to improve the reliability of LLM - generated responses, with a focus on dialogue state tracking in task - oriented dialogue systems. These methods include those based on softmax, raw token scores, verbally expressed confidence, and combinations of these methods, and the calibration situation is evaluated using the AUC metric. In addition, the paper also proposes a self - probing mechanism for closed models to further improve the calibration level of confidence scores. Through these methods, the researchers aim to provide more reliable dialogue systems that can better handle uncertainty, thereby improving the overall performance and user satisfaction of the system.

Confidence Estimation for LLM-Based Dialogue State Tracking

Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs

A Survey of Confidence Estimation and Calibration in Large Language Models

Mismatch between Multi-turn Dialogue and its Evaluation Metric in Dialogue State Tracking

Large Language Model Confidence Estimation via Black-Box Access

Confidence Under the Hood: An Investigation into the Confidence-Probability Alignment in Large Language Models

Enhancing Dialogue State Tracking Models through LLM-backed User-Agents Simulation

A Comprehensive Study of Multilingual Confidence Estimation on Large Language Models

Confidence Calibration and Rationalization for LLMs via Multi-Agent Deliberation

SaySelf: Teaching LLMs to Express Confidence with Self-Reflective Rationales

Large Language Models as Zero-shot Dialogue State Tracker through Function Calling

The Calibration Gap between Model and Human Confidence in Large Language Models

MlingConf: A Comprehensive Study of Multilingual Confidence Estimation on Large Language Models

Llamas Know What GPTs Don't Show: Surrogate Models for Confidence Estimation

Leveraging LLMs for Dialogue Quality Measurement

Graph-based Confidence Calibration for Large Language Models

STN4DST: A Scalable Dialogue State Tracking based on Slot Tagging Navigation

A Two-dimensional Zero-shot Dialogue State Tracking Evaluation Method using GPT-4

CAUSE: Counterfactual Assessment of User Satisfaction Estimation in Task-Oriented Dialogue Systems

Overconfidence is Key: Verbalized Uncertainty Evaluation in Large Language and Vision-Language Models

Granular Change Accuracy: A More Accurate Performance Metric for Dialogue State Tracking