RET-CLIP: A Retinal Image Foundation Model Pre-trained with Clinical Diagnostic Reports

Jiawei Du,Jia Guo,Weihang Zhang,Shengzhu Yang,Hanruo Liu,Huiqi Li,Ningli Wang
2024-08-19
Abstract:The Vision-Language Foundation model is increasingly investigated in the fields of computer vision and natural language processing, yet its exploration in ophthalmology and broader medical applications remains limited. The challenge is the lack of labeled data for the training of foundation model. To handle this issue, a CLIP-style retinal image foundation model is developed in this paper. Our foundation model, RET-CLIP, is specifically trained on a dataset of 193,865 patients to extract general features of color fundus photographs (CFPs), employing a tripartite optimization strategy to focus on left eye, right eye, and patient level to reflect real-world clinical scenarios. Extensive experiments demonstrate that RET-CLIP outperforms existing benchmarks across eight diverse datasets spanning four critical diagnostic categories: diabetic retinopathy, glaucoma, multiple disease diagnosis, and multi-label classification of multiple diseases, which demonstrate the performance and generality of our foundation model. The sourse code and pre-trained model are available at <a class="link-external link-https" href="https://github.com/sStonemason/RET-CLIP" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address the following issues: In the fields of computer vision and natural language processing, research on Foundation Models is increasing, but exploration in ophthalmology and other medical applications remains limited. The main challenge lies in the lack of annotated data for training foundation models. To tackle this problem, the paper proposes a CLIP-style retinal image foundation model called RET-CLIP. This model utilizes a clinical diagnosis report dataset from 193,865 patients and employs a triple optimization strategy to focus on data from the left eye, right eye, and patient levels, reflecting real clinical scenarios. Specifically, RET-CLIP enhances the model's understanding of retinal images by integrating textual information from clinical diagnosis reports and has been extensively validated on eight different datasets. These datasets cover four key diagnostic categories: diabetic retinopathy, glaucoma, multiple disease diagnoses, and multi-label classification. Experimental results show that RET-CLIP outperforms existing benchmark models in all tasks, demonstrating its performance and generalization capabilities.