EyeCLIP: A visual-language foundation model for multi-modal ophthalmic image analysis

Danli Shi,Weiyi Zhang,Jiancheng Yang,Siyu Huang,Xiaolan Chen,Mayinuer Yusufu,Kai Jin,Shan Lin,Shunming Liu,Qing Zhang,Mingguang He

2024-09-12

Abstract:Early detection of eye diseases like glaucoma, macular degeneration, and diabetic retinopathy is crucial for preventing vision loss. While artificial intelligence (AI) foundation models hold significant promise for addressing these challenges, existing ophthalmic foundation models primarily focus on a single modality, whereas diagnosing eye diseases requires multiple modalities. A critical yet often overlooked aspect is harnessing the multi-view information across various modalities for the same patient. Additionally, due to the long-tail nature of ophthalmic diseases, standard fully supervised or unsupervised learning approaches often struggle. Therefore, it is essential to integrate clinical text to capture a broader spectrum of diseases. We propose EyeCLIP, a visual-language foundation model developed using over 2.77 million multi-modal ophthalmology images with partial text data. To fully leverage the large multi-modal unlabeled and labeled data, we introduced a pretraining strategy that combines self-supervised reconstructions, multi-modal image contrastive learning, and image-text contrastive learning to learn a shared representation of multiple modalities. Through evaluation using 14 benchmark datasets, EyeCLIP can be transferred to a wide range of downstream tasks involving ocular and systemic diseases, achieving state-of-the-art performance in disease classification, visual question answering, and cross-modal retrieval. EyeCLIP represents a significant advancement over previous methods, especially showcasing few-shot, even zero-shot capabilities in real-world long-tail scenarios.

Computer Vision and Pattern Recognition,Artificial Intelligence

What problem does this paper attempt to address?

The paper aims to address the following issues: 1. **Multimodal Ophthalmic Image Analysis**: Most current foundational models in ophthalmology focus primarily on a single modality, whereas diagnosing eye diseases typically requires information from multiple modalities. The paper proposes a visual-language foundational model, EyeCLIP, which is trained using multimodal image data to better capture the multi-perspective information of eye diseases. 2. **Long-Tail Distribution Problem**: Due to the long-tail distribution characteristic of ophthalmic diseases, standard supervised or unsupervised learning methods are difficult to handle. EyeCLIP overcomes this challenge by incorporating clinical text information to capture a broader spectrum of eye diseases. 3. **Multimodal Consistency and Image-Text Alignment**: Existing foundational models lack consistency between modalities and the ability to align images with text, which is a critical issue in practical applications. EyeCLIP learns shared representations across different modalities by combining self-supervised reconstruction, multimodal image contrastive learning, and image-text contrastive learning methods. 4. **Zero-Shot and Few-Shot Learning Capabilities**: The paper demonstrates EyeCLIP's zero-shot and few-shot learning capabilities in real-world long-tail scenarios, effectively handling situations with scarce data. In summary, the goal of this paper is to develop a visual-language foundational model that excels in multimodal ophthalmic image analysis tasks and possesses zero-shot and few-shot learning capabilities.

EyeCLIP: A visual-language foundation model for multi-modal ophthalmic image analysis

VisionCLIP: An Med-AIGC based Ethical Language-Image Foundation Model for Generalizable Retina Image Analysis

RET-CLIP: A Retinal Image Foundation Model Pre-trained with Clinical Diagnostic Reports

EyeFound: A Multimodal Generalist Foundation Model for Ophthalmic Imaging

A foundation model for generalizable disease detection from retinal images

Diagnosing Systemic Disorders with AI Algorithms Based on Ocular Images

VisionFM: a Multi-Modal Multi-Task Vision Foundation Model for Generalist Ophthalmic Artificial Intelligence

Vision–language foundation model for echocardiogram interpretation

CLIP in Medical Imaging: A Comprehensive Survey

OphCLIP: Hierarchical Retrieval-Augmented Learning for Ophthalmic Surgical Video-Language Pretraining

MultiEYE: Dataset and Benchmark for OCT-Enhanced Retinal Disease Recognition from Fundus Images

BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

DiffCLIP: Few-shot Language-driven Multimodal Classifier

VisionUnite: A Vision-Language Foundation Model for Ophthalmology Enhanced with Clinical Knowledge

Ocular Disease Detection from Multiple Informatics Domains.

Multi-Modal Multi-Instance Learning for Retinal Disease Recognition

OphGLM: Training an Ophthalmology Large Language-and-Vision Assistant based on Instructions and Dialogue

RemoteCLIP: A Vision Language Foundation Model for Remote Sensing

A Novel Hierarchical Deep Learning Framework for Diagnosing Multiple Visual Impairment Diseases in the Clinical Environment

RadCLIP: Enhancing Radiologic Image Analysis through Contrastive Language-Image Pre-training

A deep-learning pipeline for the diagnosis and grading of common blinding ophthalmic diseases based on lesion-focused classification model