Performance of ChatGPT-4o in Real-Time Medical Consultation for Retroperitoneal Fibrosis Patients Under Doctor Supervision: A Cross-Sectional Study in a Chinese Clinical Setting (Preprint)

Hui Gao,Wuji Zhang,Shibo Liu,Yuanning Li,Yingxi Zhu,Ting Long,Ruohan Yu,Qian Guo,Yadan Zou,Ji Li,Lina Zhang,Cui Yang,Yubing Tong,Xuewu Zhang
DOI: https://doi.org/10.2196/preprints.64700
2024-01-01
Abstract:LLMs like GPT-4 show promise in medical consultations but face challenges in non-English or real-time contexts. The new GPT-4o, with improved text processing and faster responses, may better address rare diseases like retroperitoneal fibrosis (RPF). Performance of GPT-4o in providing real-time medical consultations for patients with rare disease remains underexplored, which is generally a challenge in clinical practice. We evaluate the competency of GPT-4o to generate responses to a rare autoimmune RPF on accuracy, completeness, readability, and quality, using a 7-point Likert scale. A total of 103 real-world RPF patients queries were collected from diverse sources. Responses were generated using the newly released version of GPT-4o (2024/5/17). All questions were also stratified and randomly divided into six groups. Six attending rheumatologists were assigned to answer one set of questions, then generated new responses with assistance of GPT-4o. All the responses were assessed blindly by three experts in RPF. GPT-4o scored significantly higher than rheumatologists in accuracy (6.39 ± 0.50 vs. 4.99 ± 0.62), completeness (6.51 ± 0.44 vs. 4.55 ± 0.60), readability (6.45 ± 0.42 vs. 4.93 ± 0.59), and quality (6.42 ± 0.46 vs. 4.78 ± 0.55) (p < 0.001). Competency of rheumatologists + GPT-4o was better than that of rheumatologists alone (accuracy: 6.13 ± 0.63, completeness: 5.99 ± 0.81, readability: 6.05 ± 0.67, quality: 6.01 ± 0.71. p < 0.001), and physician revisions generally reduced the competency of GPT-4o. Subgroup analysis showed no significant difference on accuracy between GPT-4o and rheumatologists + GPT-4o in answering complex questions, but any type of revision lowered the competency of GPT-4o. GPT-4o has the potential to provide real-time medical consultations for RPF in the Chinese clinical environment.
What problem does this paper attempt to address?