Benchmarking the Confidence of Large Language Models in Clinical Questions

Mahmud Omar Sr.,Reem Agbareia,Benjamin S Glicksberg Sr.,Girish Nadkarni,Eyal Klang
DOI: https://doi.org/10.1101/2024.08.11.24311810
2024-09-10
Abstract:Background and Aim: Large language models (LLMs) show promise in healthcare, but their self-assessment capabilities remain unclear. This study evaluates the confidence levels and performance of 12 LLMs across five medical specialties to assess their ability to accurately judge their responses. Methods: We used 1965 multiple-choice questions from internal medicine, obstetrics and gynecology, psychiatry, pediatrics, and general surgery. Models were prompted to provide answers and confidence scores. Performance and confidence were analyzed using chi-square tests and t-tests. Consistency across question versions was also evaluated. Results: All models displayed high confidence regardless of answer correctness. Higher-tier models showed slightly better calibration, with a mean confidence of 72.5% for correct answers versus 69.4% for incorrect ones, compared to lower-tier models (79.6% vs 79.5%). The mean confidence difference between correct and incorrect responses ranged from 0.6% to 5.4% across all models. Four models showed significantly higher confidence when correct (p<0.01), but the difference remained small. Most models demonstrated consistency across question versions. Conclusion: While newer LLMs show improved performance and consistency in medical knowledge tasks, their confidence levels remain poorly calibrated. The gap between performance and self-assessment poses risks in clinical applications. Until these models can reliably gauge their certainty, their use in healthcare should be limited and supervised by experts. Further research on human-AI collaboration and ensemble methods is needed for responsible implementation. Keywords: Large Language Models (LLMs), Safe AI, AI Reliability, Clinical knowledge.
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to evaluate the relationship between the self - confidence of large language models (LLMs) and their accuracy when answering clinical medical questions. Specifically, researchers hope to evaluate whether these models can accurately judge their confidence in answers by analyzing the performance of different LLMs in five medical specialty areas. The study found that although some models with better performance showed more consistent overall confidence levels, even for these most accurate models, the difference in confidence between correct and incorrect answers was very small. This indicates that there are important limitations in the current LLMs' self - assessment mechanisms, emphasizing the importance of further research before integrating them into clinical settings. ### Main research questions: 1. **Evaluating the self - confidence and accuracy of LLMs**: Researchers hope to understand the relationship between the self - confidence of different LLMs when answering clinical medical questions and their actual accuracy. 2. **Calibration of model self - confidence**: Explore whether these models can accurately judge whether their answers are correct or incorrect. 3. **Performance differences of different models**: Compare the performance of different LLMs in five medical specialty areas, including internal medicine, obstetrics and gynecology, psychiatry, pediatrics, and general surgery. ### Research methods: - **Data sources**: A dataset containing 655 questions compiled by Katz et al. was used, and each question was rephrased twice through the GPT - 4 API, ultimately obtaining 1,965 questions. - **Model settings**: Twelve different LLMs were required to provide answers for each option and their corresponding confidence scores (0 - 100). - **Statistical analysis**: Pearson's correlation coefficient was used to analyze the relationship between confidence and accuracy, the chi - square test was used to evaluate the performance differences of different models in each area, and the t - test was used to compare the confidence differences between correct and incorrect answers. ### Main findings: - **Inverse correlation**: The study found an inverse correlation between confidence and accuracy ($r=- 0.40$, $p = 0.001$), that is, models with poorer performance tend to show higher confidence. - **Small confidence difference**: Even for the most accurate model, such as GPT - 4o, the confidence difference between correct and incorrect answers is only 5.4 percentage points. - **Model performance differences**: GPT - 4o and Claude 3.5 Sonnet performed best in multiple medical fields, while models such as Qwen - 2 - 7B performed poorly. ### Conclusions: - **Insufficient self - confidence calibration**: Although models with better performance show more consistent overall confidence levels, the confidence difference between their correct and incorrect answers is still very small, which limits their application in clinical decision - making. - **Necessity for further research**: Before integrating these models into clinical settings, further research is required to improve the accuracy of their self - confidence calibration. This paper, through detailed experiments and data analysis, reveals the potential risks and limitations of current LLMs in the application of clinical medicine, providing an important reference for future research.