Abstract:Improving healthcare quality and access remains a critical concern for countries worldwide. Consequently, the rise of large language models (LLMs) has erupted a wealth of discussion around healthcare applications among researchers and consumers alike. While the ability of these models to pass medical exams has been used to argue in favour of their use in medical training and diagnosis, the impact of their inevitable use as a self-diagnostic tool and their role in spreading healthcare misinformation has not been evaluated. In this work, we critically evaluate LLMs' capabilities from the lens of a general user self-diagnosing, as well as the means through which LLMs may aid in the spread of medical misinformation. To accomplish this, we develop a testing methodology which can be used to evaluate responses to open-ended questions mimicking real-world use cases. In doing so, we reveal that a) these models perform worse than previously known, and b) they exhibit peculiar behaviours, including overconfidence when stating incorrect recommendations, which increases the risk of spreading medical misinformation.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to evaluate the performance of large language models (LLMs) in medical self-diagnosis and their potential risks in spreading medical misinformation. Specifically, the authors focus on the following aspects: 1. **Evaluating the actual performance of LLMs in medical self-diagnosis**: - The authors believe that most current studies assess LLMs' performance in medical exams by having them choose from multiple options, which does not reflect LLMs' ability to handle open-ended questions in real-world scenarios. - To more realistically simulate the scenario where ordinary users use LLMs for self-diagnosis, the authors designed a new testing method that does not provide options but requires LLMs to directly answer open-ended questions. 2. **Exploring the risks of LLMs in spreading medical misinformation**: - The authors found that LLMs might exhibit overconfidence when answering medical questions, even if their recommendations are incorrect. This behavior increases the risk of spreading medical misinformation. - The authors also evaluated LLMs' performance in self-assessing their answers and found that LLMs lack the ability to verify their own answers, further exacerbating the risk of misinformation spread. 3. **Developing a repeatable evaluation method**: - The authors developed a repeatable method for evaluating LLMs' performance in medical diagnosis. This method is not only applicable to the USMLE dataset but can also be extended to other datasets. - By involving non-medical experts in the evaluation, the authors simulated the self-diagnosis process of ordinary users, thereby revealing potential issues in the practical application of LLMs. ### Main Conclusions - **LLMs perform poorly without options**: When no options are provided, LLMs' performance in answering medical questions significantly declines. - **LLMs lack uncertainty expression**: LLMs often do not express uncertainty or provide disclaimers when answering medical questions, increasing the risk of users believing incorrect information. - **LLMs lack self-verification ability**: When asked to evaluate their own answers, LLMs show a lack of confidence in their answers, indicating a deficiency in verifying information accuracy. - **A repeatable evaluation method was developed**: The method proposed by the authors can be used to evaluate other LLMs' performance in medical diagnosis, providing a reference for future related research. ### Significance - **Implications for clinicians**: This paper provides a new evaluation method that can help clinicians better understand the limitations of LLMs in medical applications. - **Implications for machine learning researchers**: This paper emphasizes the need to improve the interpretability and explainability of LLMs, especially in medical applications. - **Warnings for ordinary users**: This paper reminds ordinary users to be cautious when using LLMs for self-diagnosis and to be aware of the potential risks of misinformation. In summary, this paper provides an in-depth evaluation of LLMs' performance in medical self-diagnosis, reveals their potential risks in spreading medical misinformation, and proposes a repeatable evaluation method, offering important references for future research and applications.

Self-Diagnosis and Large Language Models: A New Front for Medical Misinformation

Language models are susceptible to incorrect patient self-diagnosis in medical applications

Large language models propagate race-based medicine

Evaluating large language models in medical applications: a survey

Evaluating large language models on medical, lay language, and self-reported descriptions of genetic conditions

Large language models in medicine: the potentials and pitfalls

Large language models in medical and healthcare fields: applications, advances, and challenges

Demystifying Large Language Models for Medicine: A Primer

Benchmarking the Confidence of Large Language Models in Clinical Questions

Evaluating Anti-LGBTQIA+ Medical Bias in Large Language Models

A Comprehensive Evaluation of Large Language Models on Mental Illnesses

Evaluating large language models for use in healthcare: A framework for translational value assessment

Large Language Model Prompting Techniques for Advancement in Clinical Medicine

Large language models encode clinical knowledge

Digital Diagnostics: The Potential Of Large Language Models In Recognizing Symptoms Of Common Illnesses

Understanding the concerns and choices of public when using large language models for healthcare

The long but necessary road to responsible use of large language models in healthcare research

A Survey of Large Language Models for Healthcare: from Data, Technology, and Applications to Accountability and Ethics

Large language models in healthcare and medical domain: A review

The future landscape of large language models in medicine