Evaluating the Impact of a Specialized LLM on Physician Experience in Clinical Decision Support: A Comparison of Ask Avo and ChatGPT-4

Daniel Jung,Alex Butler,Joongheum Park,Yair Saperstein
2024-09-07
Abstract:The use of Large language models (LLMs) to augment clinical decision support systems is a topic with rapidly growing interest, but current shortcomings such as hallucinations and lack of clear source citations make them unreliable for use in the clinical environment. This study evaluates Ask Avo, an LLM-derived software by AvoMD that incorporates a proprietary Language Model Augmented Retrieval (LMAR) system, in-built visual citation cues, and prompt engineering designed for interactions with physicians, against ChatGPT-4 in end-user experience for physicians in a simulated clinical scenario environment. Eight clinical questions derived from medical guideline documents in various specialties were prompted to both models by 62 study participants, with each response rated on trustworthiness, actionability, relevancy, comprehensiveness, and friendly format from 1 to 5. Ask Avo significantly outperformed ChatGPT-4 in all criteria: trustworthiness (4.52 vs. 3.34, p<0.001), actionability (4.41 vs. 3.19, p<0.001), relevancy (4.55 vs. 3.49, p<0.001), comprehensiveness (4.50 vs. 3.37, p<0.001), and friendly format (4.52 vs. 3.60, p<0.001). Our findings suggest that specialized LLMs designed with the needs of clinicians in mind can offer substantial improvements in user experience over general-purpose LLMs. Ask Avo's evidence-based approach tailored to clinician needs shows promise in the adoption of LLM-augmented clinical decision support software.
Human-Computer Interaction,Artificial Intelligence
What problem does this paper attempt to address?
The paper aims to address the application of large language models (LLMs) in clinical decision support systems. Although LLMs have the capability to process vast amounts of information and quickly provide relevant insights, they currently have some shortcomings, such as hallucinations (generating responses that seem reasonable but are actually incorrect or irrelevant) and unclear citation of sources, which make them less reliable in medical environments. Specifically, the paper evaluates the performance of a specially designed LLM software for clinical decision support, Ask Avo, by comparing it with ChatGPT-4. Ask Avo combines a proprietary Language Model Augmented Retrieval (LMAR) system, built-in visual citation prompts, and prompt engineering tailored for physician interaction. The main goal of the study is to compare the user experience of Ask Avo with that of a general-purpose LLM (such as ChatGPT-4) in a simulated clinical scenario environment, particularly focusing on differences in trustworthiness, operability, relevance, comprehensiveness, and user-friendly formatting.