Development and evaluation of a large language model of ophthalmology in Chinese

Ce Zheng,Hongfei Ye,Jinming Guo,Junrui Yang,Ping Fei,Yuanzhi Yuan,Danqing Huang,Yuqiang Huang,Jie Peng,Xiaoling Xie,Meng Xie,Peiquan Zhao,Li Chen,Mingzhi Zhang
DOI: https://doi.org/10.1136/bjo-2023-324526
2024-07-17
Abstract:Background: Large language models (LLMs), such as ChatGPT, have considerable implications for various medical applications. However, ChatGPT's training primarily draws from English-centric internet data and is not tailored explicitly to the medical domain. Thus, an ophthalmic LLM in Chinese is clinically essential for both healthcare providers and patients in mainland China. Methods: We developed an LLM of ophthalmology (MOPH) using Chinese corpora and evaluated its performance in three clinical scenarios: ophthalmic board exams in Chinese, answering evidence-based medicine-oriented ophthalmic questions and diagnostic accuracy for clinical vignettes. Additionally, we compared MOPH's performance to that of human doctors. Results: In the ophthalmic exam, MOPH's average score closely aligned with the mean score of trainees (64.7 (range 62-68) vs 66.2 (range 50-92), p=0.817), but achieving a score above 60 in all seven mock exams. In answering ophthalmic questions, MOPH demonstrated an adherence of 83.3% (25/30) of responses following Chinese guidelines (Likert scale 4-5). Only 6.7% (2/30, Likert scale 1-2) and 10% (3/30, Likert scale 3) of responses were rated as 'poor or very poor' or 'potentially misinterpretable inaccuracies' by reviewers. In diagnostic accuracy, although the rate of correct diagnosis by ophthalmologists was superior to that by MOPH (96.1% vs 81.1%, p>0.05), the difference was not statistically significant. Conclusion: This study demonstrated the promising performance of MOPH, a Chinese-specific ophthalmic LLM, in diverse clinical scenarios. MOPH has potential real-world applications in Chinese-language ophthalmology settings.
What problem does this paper attempt to address?