Large language models in solving clinical dilemmas - advantages and drawbacks

Obla Amar Bapu,D.,Juan,J.,Gahleitner,F.,Macleod,K.,Urquhart,D. S.,Mcdougall,C.,Unger,S. A.,Armstrong,D.,Narayanan,M.,Juan,J.
DOI: https://doi.org/10.1183/13993003.congress-2024.pa4379
IF: 24.3
2024-11-01
European Respiratory Journal
Abstract:Background: Large language models (LLMs) have shown potential to assist clinical decisions but concerns about underlying mechanism ('black box phenomenon') have provoked unease. Aims: To explore the LLM black box in the context of complex decision making in paediatric respiratory medicine (PRM) by comparison against trainee doctors (TDs), qualitative assessment of responses and a deep-dive into their sources. Methods: Six complex PRM scenarios were posed to 10 TDs and 3 LLMs. Six PRM experts provided detailed comments on the responses, which were analysed qualitatively using DisplayRTM. Screen recordings of TDs' search and sources provided by LLMs were analysed for further insight. Results: Word-cloud analysis shows features of LLM and TD responses (Table). ChatGPTTM provided useful responses but did not quote sources. TDs sourced responses from patient-facing websites, abstracts and review articles (Figure), while BardTM and BingTM quoted journals or evidence-based reviews. ChatGPTBardBingTDsStructureGoodGoodAdequatePoorOmissionsNoOccasionalYesVariableNew AdvancesPre-2021YesYesVariableIncorrectNoNoNoNoSentiment score0.85–0.15–1.0–0.34 Conclusion: LLMs (particularly ChatGPT and Bard) already surpass TDs in quality of responses. We demonstrate utility of LLMs for non-expert clinicians faced with complex medical scenarios. We explore the drawbacks of the individual LLMs while noting that this is a rapidly changing area.
respiratory system
What problem does this paper attempt to address?