Abstract:Integrating deep learning into medical imaging is poised to greatly advance diagnostic methods but it faces challenges with generalizability. Foundation models, based on self-supervised learning, address these issues and improve data efficiency. Natural domain foundation models show promise for medical imaging, but systematic research evaluating domain adaptation, especially using self-supervised learning and parameter-efficient fine-tuning, remains underexplored. Additionally, little research addresses the issue of catastrophic forgetting during fine-tuning of foundation models. We adapted the DINOv2 vision transformer for retinal imaging classification tasks using self-supervised learning and generated two novel foundation models termed DINORET and BE DINORET. Publicly available color fundus photographs were employed for model development and subsequent fine-tuning for diabetic retinopathy staging and glaucoma detection. We introduced block expansion as a novel domain adaptation strategy and assessed the models for catastrophic forgetting. Models were benchmarked to RETFound, a state-of-the-art foundation model in ophthalmology. DINORET and BE DINORET demonstrated competitive performance on retinal imaging tasks, with the block expanded model achieving the highest scores on most datasets. Block expansion successfully mitigated catastrophic forgetting. Our few-shot learning studies indicated that DINORET and BE DINORET outperform RETFound in terms of data-efficiency. This study highlights the potential of adapting natural domain vision models to retinal imaging using self-supervised learning and block expansion. BE DINORET offers robust performance without sacrificing previously acquired capabilities. Our findings suggest that these methods could enable healthcare institutions to develop tailored vision models for their patient populations, enhancing global healthcare inclusivity.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is how to adapt natural - domain foundation models (such as DINOv2 Vision Transformer) to retinal imaging tasks through self - supervised learning (SSL) and block expansion (BE) techniques, in order to improve model performance and avoid catastrophic forgetting. Specifically, the researchers hope to achieve the following goals: 1. **Improve the generalization ability in the medical imaging field**: Existing deep - learning models often perform poorly when facing the distribution differences between training data sets and clinical environments. This has led to distrust of artificial intelligence among doctors and the public and has exacerbated racial biases. By using self - supervised learning, the data efficiency of the model can be improved and racial biases can be reduced. 2. **Develop an efficient retinal disease classification model**: Using publicly available color fundus photograph (CFP) data sets, the researchers hope to generate two new foundation models - DINORET and BE DINORET through SSL and BE techniques for diabetic retinopathy (DR) staging and glaucoma detection. 3. **Alleviate the problem of catastrophic forgetting**: When fine - tuning a pre - trained foundation model, the model may forget the knowledge it has previously learned. This phenomenon is called catastrophic forgetting. The researchers introduced the block expansion method to minimize the number of trainable parameters, thereby maintaining the model features and preventing catastrophic forgetting. 4. **Verify the effectiveness of the new method**: By comparing with the existing advanced model RETFound, evaluate the performance of DINORET and BE DINORET in retinal imaging tasks, especially in terms of data efficiency and performance. ### Research Background As deep learning is increasingly widely used in medical image diagnosis, how to ensure the generalization ability and fairness of the model has become an important issue. Natural - domain foundation models (such as DINOv2) are pre - trained on large - scale image data sets and have strong feature representation capabilities, but when directly applied to medical images, domain adaptation is still required. In addition, catastrophic forgetting is a common challenge, especially during the fine - tuning process. Therefore, this study aims to explore an effective method to make these foundation models better adapt to retinal imaging tasks while maintaining their original performance. ### Main Contributions 1. **Self - supervised learning strategy**: It is introduced how to use the self - supervised learning strategy to adapt the Vision Transformer model in the natural domain to the medical domain, and a new method for generating medical foundation models is proposed. 2. **Practicality of the model architecture**: Keep a relatively small model architecture to reduce computational requirements and make it easier to deploy in clinical environments. 3. **Data efficiency**: Demonstrate the superiority of DINOv2, DINORET and BE DINORET in terms of data efficiency. 4. **Performance improvement**: Prove that these models outperform RETFound in all experiments, especially BE DINORET performs excellently without freezing the backbone network, successfully avoiding catastrophic forgetting. 5. **Suggestions for future benchmark tests**: It is recommended that future benchmark tests should focus on embedding quality rather than fine - tuning strategies to ensure fair comparison and reduce over - fitting. Through these efforts, the researchers hope to be able to provide customized visual models for medical institutions and enhance the inclusiveness and fairness of global medical services.

Block Expanded DINORET: Adapting Natural Domain Foundation Models for Retinal Imaging Without Catastrophic Forgetting

Evaluating General Purpose Vision Foundation Models for Medical Image Analysis: An Experimental Study of DINOv2 on Radiology Benchmarks

Training a high-performance retinal foundation model with half-the-data and 400 times less compute

RetiGen: A Framework for Generalized Retinal Diagnosis Using Multi-View Fundus Images

CADA: Multi-scale Collaborative Adversarial Domain Adaptation for unsupervised optic disc and cup segmentation

Exploring the Transferability of a Foundation Model for Fundus Images: Application to Hypertensive Retinopathy

Learning to Adapt Foundation Model DINOv2 for Capsule Endoscopy Diagnosis

Deep Learning Based Retinal Layer Segmentation in Optical Coherence Tomography Scans of Patients with Inherited Retinal Diseases

Are Natural Domain Foundation Models Useful for Medical Image Classification?

Do Vision Foundation Models Enhance Domain Generalization in Medical Image Segmentation?

RET-CLIP: A Retinal Image Foundation Model Pre-trained with Clinical Diagnostic Reports

A foundation model for generalizable disease detection from retinal images

Fine-Tuning SSL-Model to Enhance Detection of Cilioretinal Arteries on Colored Fundus Images

Foundation model-driven distributed learning for enhanced retinal age prediction

Surgical-DINO: Adapter Learning of Foundation Models for Depth Estimation in Endoscopic Surgery

A Novel Artificial-Intelligence-Based Approach for Automatic Assessment of Retinal Disease Images Using Multi-View Deep-Broad Learning Network

Foundation models in gastrointestinal endoscopic AI: Impact of architecture, pre-training approach and data efficiency

DRStageNet: Deep Learning for Diabetic Retinopathy Staging from Fundus Images

Accuracy of a New Foundation Model in Glaucoma Detection using Ocular Coherence Tomography Images

Adapting Pretrained Vision-Language Foundational Models to Medical Imaging Domains

EyeFound: A Multimodal Generalist Foundation Model for Ophthalmic Imaging