Testing the Limits of Language Models: A Conversational Framework for Medical AI Assessment
Shreya Johri,Jaehwan Jeong,Benjamin A. Tran,Daniel I. Schlessinger,Shannon Wongvibulsin,Zhuo Ran Cai,Roxana Daneshjou,Pranav Rajpurkar,Johri,S.,Jeong,J.,Tran,B. A.,Schlessinger,D. I.,Wongvibulsin,S.,Cai,Z. R.,Daneshjou,R.,Rajpurkar,P.
DOI: https://doi.org/10.1101/2023.09.12.23295399
2023-09-13
MedRxiv
Abstract:Large Language Models (LLMs) show promise for medical diagnosis, but traditional evaluations using static exam questions overlook the complexity of real-world clinical dialogues. We introduce a multi-agent conversational framework where doctor-AI and patient-AI agents interact to diagnose medical conditions, evaluated by a grader-AI agent and medical experts. We assessed the diagnostic accuracy of GPT-4 and GPT-3.5, in conversational versus static settings using 140 cases focusing on skin diseases. Our study revealed a decline in diagnostic accuracy, unmasking key limitations in LLMs' ability to integrate details from conversational interactions to improve diagnostic performance. We introduced Conversational Summarization, a technique that enhanced performance, and expert review identified deficiencies compared to human dermatologists in comprehensive history gathering, appropriate use of terminology, and reliability. Our findings advocate for nuanced, rigorous evaluation of LLMs before clinical integration, and our framework represents a significant advancement toward responsible testing methodologies in medicine.