OphGLM: Training an Ophthalmology Large Language-and-Vision Assistant based on Instructions and Dialogue

Weihao Gao,Zhuo Deng,Zhiyuan Niu,Fuju Rong,Chucheng Chen,Zheng Gong,Wenze Zhang,Daimin Xiao,Fang Li,Zhenjie Cao,Zhaoyi Ma,Wenbin Wei,Lan Ma
2023-06-22
Abstract:Large multimodal language models (LMMs) have achieved significant success in general domains. However, due to the significant differences between medical images and text and general web content, the performance of LMMs in medical scenarios is limited. In ophthalmology, clinical diagnosis relies on multiple modalities of medical images, but unfortunately, multimodal ophthalmic large language models have not been explored to date. In this paper, we study and construct an ophthalmic large multimodal model. Firstly, we use fundus images as an entry point to build a disease assessment and diagnosis pipeline to achieve common ophthalmic disease diagnosis and lesion segmentation. Then, we establish a new ophthalmic multimodal instruction-following and dialogue fine-tuning dataset based on disease-related knowledge data and publicly available real-world medical dialogue. We introduce visual ability into the large language model to complete the ophthalmic large language and vision assistant (OphGLM). Our experimental results demonstrate that the OphGLM model performs exceptionally well, and it has the potential to revolutionize clinical applications in ophthalmology. The dataset, code, and models will be made publicly available at <a class="link-external link-https" href="https://github.com/ML-AILab/OphGLM" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the limited performance of current large - scale multimodal language models (LMMs) in ophthalmic medical scenarios. Specifically, due to the significant differences between medical images and texts and general network content, existing LMMs may produce inaccurate or incorrect answers when handling professional medical conversations. Especially in the field of ophthalmology, clinical diagnosis depends on multimodal medical images (such as fundus images, OCT, etc.), but currently, no large - scale multimodal language model for ophthalmology has been explored. To solve these problems, the authors propose the following goals: 1. **Construct an ophthalmic multimodal large - model**: Use fundus images as an entry point to establish a disease assessment and diagnosis pipeline to achieve the diagnosis of common ophthalmic diseases and lesion segmentation. 2. **Create a new instruction and dialogue fine - tuning dataset**: Based on disease - related knowledge data and publicly available real - world medical conversations, establish a dataset for ophthalmic multimodal instruction following and dialogue fine - tuning. 3. **Introduce visual capabilities**: Introduce visual capabilities into large - scale language models to complete language and visual - aided tasks in the field of ophthalmology (OphGLM). Through these efforts, the authors hope that the OphGLM model can perform excellently in ophthalmic clinical applications and may revolutionize ophthalmic clinical practice. ### Specific Problem Summary - **Limitations of Existing Models**: Existing large - scale multimodal language models perform poorly when handling medical images and texts, especially in the field of ophthalmology, because there are significant differences between medical images and texts and general network content. - **Lack of Specialized Ophthalmic Multimodal Models**: Currently, there are no specialized multimodal large - scale language models for ophthalmology, resulting in the inability to fully utilize advanced natural language processing techniques in ophthalmic clinical diagnosis. - **Improve the Professionalism and Accuracy of the Model**: By combining real - world medical conversations and medical knowledge graphs, improve the professionalism and accuracy of the model in the field of ophthalmology, thereby better supporting clinical diagnosis and patient consultation. ### Solutions - **Construct a Fundus Image Diagnosis Pipeline**: Develop a computer vision model that can process fundus images and perform disease classification and lesion segmentation. - **Create a High - Quality Dialogue Dataset**: Based on publicly available real doctor - patient conversation data, construct a high - quality dialogue dataset to improve the model's dialogue ability and professionalism. - **Fuse Visual and Language Models**: Combine the visual model with the large - scale language model to form a multimodal ophthalmic assistant (OphGLM) that can process image inputs and generate high - quality answers. Through these methods, the authors hope to significantly improve the performance of automated diagnosis and consultation systems in the field of ophthalmology, providing more accurate and efficient assistance to doctors and patients.