Combining Insights From Multiple Large Language Models Improves Diagnostic Accuracy

Gioele Barabucci,Victor Shia,Eugene Chu,Benjamin Harack,Nathan Fu
2024-02-14
Abstract:Background: Large language models (LLMs) such as OpenAI's GPT-4 or Google's PaLM 2 are proposed as viable diagnostic support tools or even spoken of as replacements for "curbside consults". However, even LLMs specifically trained on medical topics may lack sufficient diagnostic accuracy for real-life applications.
Artificial Intelligence
What problem does this paper attempt to address?
This paper aims to address the accuracy issues of large language models (LLMs) in medical diagnosis and explores how to improve diagnostic accuracy through collective intelligence methods. Specifically, the study finds that even specially trained LLMs may still lack sufficient diagnostic accuracy in practical applications, limiting their widespread use in the medical field. To solve this problem, researchers propose a method that generates more accurate comprehensive diagnoses by aggregating the diagnostic results of multiple different LLMs for clinical cases. This approach not only improves diagnostic accuracy but also reduces dependence on a single commercial provider, thereby enhancing medical practitioners' trust in LLMs. The study results show that the average accuracy rate after aggregating the diagnostic results of three LLMs reached 75.3%, significantly higher than the average accuracy rate of 59.0% for a single LLM. Moreover, even excluding the best-performing GPT-4, this method still maintains a high diagnostic accuracy. In summary, this study demonstrates the feasibility and effectiveness of using collective intelligence techniques to combine the diagnostic opinions of multiple LLMs to enhance overall diagnostic quality.