Block Expanded DINORET: Adapting Natural Domain Foundation Models for Retinal Imaging Without Catastrophic Forgetting

Jay Zoellin,Colin Merk,Mischa Buob,Amr Saad,Samuel Giesser,Tahm Spitznagel,Ferhat Turgut,Rui Santos,Yukun Zhou,Sigfried Wagner,Pearse A. Keane,Yih Chung Tham,Delia Cabrera DeBuc,Matthias D. Becker,Gabor M. Somfai
2024-09-26
Abstract:Integrating deep learning into medical imaging is poised to greatly advance diagnostic methods but it faces challenges with generalizability. Foundation models, based on self-supervised learning, address these issues and improve data efficiency. Natural domain foundation models show promise for medical imaging, but systematic research evaluating domain adaptation, especially using self-supervised learning and parameter-efficient fine-tuning, remains underexplored. Additionally, little research addresses the issue of catastrophic forgetting during fine-tuning of foundation models. We adapted the DINOv2 vision transformer for retinal imaging classification tasks using self-supervised learning and generated two novel foundation models termed DINORET and BE DINORET. Publicly available color fundus photographs were employed for model development and subsequent fine-tuning for diabetic retinopathy staging and glaucoma detection. We introduced block expansion as a novel domain adaptation strategy and assessed the models for catastrophic forgetting. Models were benchmarked to RETFound, a state-of-the-art foundation model in ophthalmology. DINORET and BE DINORET demonstrated competitive performance on retinal imaging tasks, with the block expanded model achieving the highest scores on most datasets. Block expansion successfully mitigated catastrophic forgetting. Our few-shot learning studies indicated that DINORET and BE DINORET outperform RETFound in terms of data-efficiency. This study highlights the potential of adapting natural domain vision models to retinal imaging using self-supervised learning and block expansion. BE DINORET offers robust performance without sacrificing previously acquired capabilities. Our findings suggest that these methods could enable healthcare institutions to develop tailored vision models for their patient populations, enhancing global healthcare inclusivity.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is how to adapt natural - domain foundation models (such as DINOv2 Vision Transformer) to retinal imaging tasks through self - supervised learning (SSL) and block expansion (BE) techniques, in order to improve model performance and avoid catastrophic forgetting. Specifically, the researchers hope to achieve the following goals: 1. **Improve the generalization ability in the medical imaging field**: Existing deep - learning models often perform poorly when facing the distribution differences between training data sets and clinical environments. This has led to distrust of artificial intelligence among doctors and the public and has exacerbated racial biases. By using self - supervised learning, the data efficiency of the model can be improved and racial biases can be reduced. 2. **Develop an efficient retinal disease classification model**: Using publicly available color fundus photograph (CFP) data sets, the researchers hope to generate two new foundation models - DINORET and BE DINORET through SSL and BE techniques for diabetic retinopathy (DR) staging and glaucoma detection. 3. **Alleviate the problem of catastrophic forgetting**: When fine - tuning a pre - trained foundation model, the model may forget the knowledge it has previously learned. This phenomenon is called catastrophic forgetting. The researchers introduced the block expansion method to minimize the number of trainable parameters, thereby maintaining the model features and preventing catastrophic forgetting. 4. **Verify the effectiveness of the new method**: By comparing with the existing advanced model RETFound, evaluate the performance of DINORET and BE DINORET in retinal imaging tasks, especially in terms of data efficiency and performance. ### Research Background As deep learning is increasingly widely used in medical image diagnosis, how to ensure the generalization ability and fairness of the model has become an important issue. Natural - domain foundation models (such as DINOv2) are pre - trained on large - scale image data sets and have strong feature representation capabilities, but when directly applied to medical images, domain adaptation is still required. In addition, catastrophic forgetting is a common challenge, especially during the fine - tuning process. Therefore, this study aims to explore an effective method to make these foundation models better adapt to retinal imaging tasks while maintaining their original performance. ### Main Contributions 1. **Self - supervised learning strategy**: It is introduced how to use the self - supervised learning strategy to adapt the Vision Transformer model in the natural domain to the medical domain, and a new method for generating medical foundation models is proposed. 2. **Practicality of the model architecture**: Keep a relatively small model architecture to reduce computational requirements and make it easier to deploy in clinical environments. 3. **Data efficiency**: Demonstrate the superiority of DINOv2, DINORET and BE DINORET in terms of data efficiency. 4. **Performance improvement**: Prove that these models outperform RETFound in all experiments, especially BE DINORET performs excellently without freezing the backbone network, successfully avoiding catastrophic forgetting. 5. **Suggestions for future benchmark tests**: It is recommended that future benchmark tests should focus on embedding quality rather than fine - tuning strategies to ensure fair comparison and reduce over - fitting. Through these efforts, the researchers hope to be able to provide customized visual models for medical institutions and enhance the inclusiveness and fairness of global medical services.