Abstract:Large multimodal language models (LMMs) have achieved significant success in general domains. However, due to the significant differences between medical images and text and general web content, the performance of LMMs in medical scenarios is limited. In ophthalmology, clinical diagnosis relies on multiple modalities of medical images, but unfortunately, multimodal ophthalmic large language models have not been explored to date. In this paper, we study and construct an ophthalmic large multimodal model. Firstly, we use fundus images as an entry point to build a disease assessment and diagnosis pipeline to achieve common ophthalmic disease diagnosis and lesion segmentation. Then, we establish a new ophthalmic multimodal instruction-following and dialogue fine-tuning dataset based on disease-related knowledge data and publicly available real-world medical dialogue. We introduce visual ability into the large language model to complete the ophthalmic large language and vision assistant (OphGLM). Our experimental results demonstrate that the OphGLM model performs exceptionally well, and it has the potential to revolutionize clinical applications in ophthalmology. The dataset, code, and models will be made publicly available at <a class="link-external link-https" href="https://github.com/ML-AILab/OphGLM" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the limited performance of current large - scale multimodal language models (LMMs) in ophthalmic medical scenarios. Specifically, due to the significant differences between medical images and texts and general network content, existing LMMs may produce inaccurate or incorrect answers when handling professional medical conversations. Especially in the field of ophthalmology, clinical diagnosis depends on multimodal medical images (such as fundus images, OCT, etc.), but currently, no large - scale multimodal language model for ophthalmology has been explored. To solve these problems, the authors propose the following goals: 1. **Construct an ophthalmic multimodal large - model**: Use fundus images as an entry point to establish a disease assessment and diagnosis pipeline to achieve the diagnosis of common ophthalmic diseases and lesion segmentation. 2. **Create a new instruction and dialogue fine - tuning dataset**: Based on disease - related knowledge data and publicly available real - world medical conversations, establish a dataset for ophthalmic multimodal instruction following and dialogue fine - tuning. 3. **Introduce visual capabilities**: Introduce visual capabilities into large - scale language models to complete language and visual - aided tasks in the field of ophthalmology (OphGLM). Through these efforts, the authors hope that the OphGLM model can perform excellently in ophthalmic clinical applications and may revolutionize ophthalmic clinical practice. ### Specific Problem Summary - **Limitations of Existing Models**: Existing large - scale multimodal language models perform poorly when handling medical images and texts, especially in the field of ophthalmology, because there are significant differences between medical images and texts and general network content. - **Lack of Specialized Ophthalmic Multimodal Models**: Currently, there are no specialized multimodal large - scale language models for ophthalmology, resulting in the inability to fully utilize advanced natural language processing techniques in ophthalmic clinical diagnosis. - **Improve the Professionalism and Accuracy of the Model**: By combining real - world medical conversations and medical knowledge graphs, improve the professionalism and accuracy of the model in the field of ophthalmology, thereby better supporting clinical diagnosis and patient consultation. ### Solutions - **Construct a Fundus Image Diagnosis Pipeline**: Develop a computer vision model that can process fundus images and perform disease classification and lesion segmentation. - **Create a High - Quality Dialogue Dataset**: Based on publicly available real doctor - patient conversation data, construct a high - quality dialogue dataset to improve the model's dialogue ability and professionalism. - **Fuse Visual and Language Models**: Combine the visual model with the large - scale language model to form a multimodal ophthalmic assistant (OphGLM) that can process image inputs and generate high - quality answers. Through these methods, the authors hope to significantly improve the performance of automated diagnosis and consultation systems in the field of ophthalmology, providing more accurate and efficient assistance to doctors and patients.

OphGLM: Training an Ophthalmology Large Language-and-Vision Assistant based on Instructions and Dialogue

OphGLM: An ophthalmology large language-and-vision assistant

Ophtha-LLaMA2: A Large Language Model for Ophthalmology

LMOD: A Large Multimodal Ophthalmology Dataset and Benchmark for Large Vision-Language Models

GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AI

Development and evaluation of a large language model of ophthalmology in Chinese

EyeGPT: Ophthalmic Assistant with Large Language Models

A case of IgE multiple myeloma

Parameter-Efficient Fine-Tuning Medical Multimodal Large Language Models for Medical Visual Grounding

Qilin-Med-VL: Towards Chinese Large Vision-Language Model for General Healthcare

LLaVA-Ultra: Large Chinese Language and Vision Assistant for Ultrasound

A Role-specific Guided Large Language Model for Ophthalmic Consultation Based on Stylistic Differentiation

Inhibition of brain sodium-potassium ATPase in uremic rats.

Language Enhanced Model for Eye (LEME): An Open-Source Ophthalmology-Specific Large Language Model

HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale

Large language models and their impact in ophthalmology

Exploring large language model for next generation of artificial intelligence in ophthalmology

Medical education with large language models in ophthalmology: custom instructions and enhanced retrieval capabilities

Utilizing Large Language Models in Ophthalmology: The Current Landscape and Challenges

LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day