Abstract:Introduction The inability for Large Language Models (LLMs) to communicate uncertainty is a significant barrier to their use in medicine. Before LLMs can be integrated into patient care, the field must assess methods to measure uncertainty in ways that are useful to physician-users. Objective Evaluate the ability for uncertainty metrics to quantify LLM confidence when performing diagnosis and treatment selection tasks by assessing the properties of discrimination and calibration. Methods We examined the discrimination and calibration of Confidence Elicitation, Token-Level Probabilities, and Sample Consistency metrics across GPT3.5, GPT4, Llama2-70B and Llama3-70B. Uncertainty metrics were evaluated against three datasets of open-ended patient scenarios. Results Sample Consistency methods outperformed Token Level Probability and Confidence Elicitation methods. Sample Consistency by sentence embedding cosine similarity achieved the highest discrimination performance with poor calibration, while Sample Consistency by GPT annotation achieved the second-best discrimination with more accurate calibration. Nearly all uncertainty metrics had better discriminative performance with diagnosis questions rather than treatment selection questions and verbalized confidence (Confidence Elicitation) was found to consistently over-estimate model confidence. Conclusions Sample Consistency methods are the optimal metrics for assessing LLM uncertainty for the tasks of medical diagnosis and treatment selection. We suggest Sample Consistency by sentence embedding cosine similarity if the user has a set of reference cases with which to re-calibrate their results, and Sample Consistency by GPT annotation if the user does not have reference cases and requires accurate raw calibration. Our results also confirm LLMs are consistently over-confident when verbalizing their confidence through Confidence Elicitation.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper primarily explores the inadequacy of large language models (LLMs) in expressing uncertainty in medical diagnosis and treatment selection tasks. Specifically, the paper aims to evaluate several different uncertainty measurement methods to quantify the confidence of LLMs in these tasks and analyze the discriminative ability and calibration performance of these methods. ### Background and Motivation In the medical field, assessing the uncertainty of diagnostic or treatment recommendations is crucial for proper patient care. Doctors need to make final management decisions based on the information provided by the model, such as prescribing medication or recommending surgery, which come with certain risks. However, while current large language models have shown near-doctor accuracy in some tasks, they struggle to effectively express uncertainty. Therefore, before integrating these models into clinical practice, it is essential to find effective methods to measure and calibrate the model's uncertainty. ### Research Objectives The main objective of the paper is to evaluate the performance of the following three uncertainty measurement methods in medical diagnosis and treatment selection tasks: 1. **Confidence Elicitation**: Directly prompting the model to express its uncertainty. 2. **Token-Level Probabilities**: Using the probabilities of each token in the generated text to calculate uncertainty. 3. **Sample Consistency**: Estimating uncertainty by running the same question multiple times and comparing the consistency of different responses. ### Methods The researchers used four large language models (GPT 3.5, GPT 4, Llama2, and Llama3) and three datasets (MedQA, NEJM Case Report Series, and Stanford custom dataset) to evaluate the performance of these uncertainty measurement methods. The specific steps include: - **Confidence Elicitation**: Using a two-step method, first generating the model response, then submitting the entire question-response pair for confidence elicitation. - **Token-Level Probabilities**: Calculating the average probability and minimum probability of tokens in the generated text. - **Sample Consistency**: Estimating uncertainty by running the same question multiple times and comparing the consistency of responses, using GPT 4 annotations and sentence embedding distance to evaluate response consistency. ### Results - **Sample Consistency Method** performed excellently in distinguishing correct and incorrect answers, especially in diagnostic tasks. Among them, the sample consistency method based on sentence embeddings showed the best discriminative performance (ROC AUC 0.68–0.79) but had poor calibration performance; while the sample consistency method based on GPT annotations had slightly inferior discriminative performance (ROC AUC 0.66–0.74) but better calibration performance. - **Confidence Elicitation Method** performed poorly in early model versions (such as GPT 3.5 and Llama2) but showed more consistent performance in newer versions (such as GPT 4 and Llama3). - **Token-Level Probabilities Method** performed well in early model versions but poorly in newer versions. ### Conclusion - **Sample Consistency Method** is the most effective method for evaluating the uncertainty of LLMs. The sample consistency method based on sentence embeddings can recalibrate results when users have a set of reference cases, while the sample consistency method based on GPT annotations is more effective when there are no reference cases and accurate original calibration is needed. - **Confidence Elicitation Method** tends to overestimate the model's confidence when verbally expressing confidence. In summary, this study provides important guidance for the use of large language models in the medical field, particularly in how to effectively evaluate and calibrate model uncertainty.

Large Language Model Uncertainty Measurement and Calibration for Medical Diagnosis and Treatment

Large language model uncertainty proxies: discrimination and calibration for medical diagnosis and treatment.

Benchmarking the Confidence of Large Language Models in Clinical Questions

Methods to Estimate Large Language Model Confidence

"I'm Not Sure, But...": Examining the Impact of Large Language Models' Uncertainty Expression on User Reliance and Trust

The Calibration Gap between Model and Human Confidence in Large Language Models

Large Language Models Must Be Taught to Know What They Don't Know

Large Language Model Confidence Estimation via Black-Box Access

Look Before You Leap: An Exploratory Study of Uncertainty Measurement for Large Language Models

Large language models encode clinical knowledge

Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs

Uncertainty Estimation of Large Language Models in Medical Question Answering

Position Paper On Diagnostic Uncertainty Estimation from Large Language Models: Next-Word Probability Is Not Pre-test Probability

Evaluating large language models on medical, lay language, and self-reported descriptions of genetic conditions

Uncertainty Quantification for Clinical Outcome Predictions with (Large) Language Models

Enhancing Healthcare LLM Trust with Atypical Presentations Recalibration

Evaluating large language models in medical applications: a survey

Large Language Models in Healthcare: A Comprehensive Benchmark

Calibrating Large Language Models with Sample Consistency

Do Large Language Models have Shared Weaknesses in Medical Question Answering?

Evaluation of large language model performance on the Biomedical Language Understanding and Reasoning Benchmark