Human-AI collectives produce the most accurate differential diagnoses

N. Zöller,J. Berger,I. Lin,N. Fu,J. Komarneni,G. Barabucci,K. Laskowski,V. Shia,B. Harack,E. A. Chu,V. Trianni,R. H.J.M. Kurvers,S. M. Herzog
2024-06-21
Abstract:Artificial intelligence systems, particularly large language models (LLMs), are increasingly being employed in high-stakes decisions that impact both individuals and society at large, often without adequate safeguards to ensure safety, quality, and equity. Yet LLMs hallucinate, lack common sense, and are biased - shortcomings that may reflect LLMs' inherent limitations and thus may not be remedied by more sophisticated architectures, more data, or more human feedback. Relying solely on LLMs for complex, high-stakes decisions is therefore problematic. Here we present a hybrid collective intelligence system that mitigates these risks by leveraging the complementary strengths of human experience and the vast information processed by LLMs. We apply our method to open-ended medical diagnostics, combining 40,762 differential diagnoses made by physicians with the diagnoses of five state-of-the art LLMs across 2,133 medical cases. We show that hybrid collectives of physicians and LLMs outperform both single physicians and physician collectives, as well as single LLMs and LLM ensembles. This result holds across a range of medical specialties and professional experience, and can be attributed to humans' and LLMs' complementary contributions that lead to different kinds of errors. Our approach highlights the potential for collective human and machine intelligence to improve accuracy in complex, open-ended domains like medical diagnostics.
Artificial Intelligence,Human-Computer Interaction
What problem does this paper attempt to address?
This paper discusses the application of artificial intelligence (AI) systems, particularly large language models (LLMs), in medical diagnosis, and how to improve diagnostic accuracy by integrating human expert knowledge. The study found that relying solely on LLMs may introduce errors, lack common sense, and be biased, which can pose risks to medical decision-making. To address these issues, the paper proposes a hybrid collective intelligence system that combines the experience of doctors with the ability of LLMs to process large amounts of information. In this system, the researchers compared 40,762 differential diagnoses made by doctors with the diagnostic results of five state-of-the-art LLMs, covering 2,133 medical cases. The results showed that the hybrid collective composed of doctors and LLMs outperformed individual doctors or collectives composed solely of LLMs in terms of diagnostic accuracy. This improvement was observed across various medical specialties and levels of experience, mainly attributed to the complementary contributions of humans and LLMs, which make different types of errors. The study also pointed out that although LLMs may perform poorly in certain situations, doctors are often able to provide correct diagnoses, emphasizing the importance of maintaining expert involvement even with strong AI support. In addition, the paper proposes a general method for integrating human expert and LLM responses to address open-ended complex problems, such as medical diagnosis. In summary, this paper aims to address how to enhance the accuracy and safety of medical diagnosis by integrating human intelligence and AI technology, reducing errors, and demonstrating the potential of this approach in practical medical practice.