Abstract:Pre-training deep learning models with large data sets of natural images, such as ImageNet, has become the standard for endoscopic image analysis. This approach is generally superior to training from scratch, due to the scarcity of high-quality medical imagery and labels. However, it is still unknown whether the learned features on natural imagery provide an optimal starting point for the downstream medical endoscopic imaging tasks. Intuitively, pre-training with imagery closer to the target domain could lead to better-suited feature representations. This study evaluates whether leveraging in-domain pre-training in gastrointestinal endoscopic image analysis has potential benefits compared to pre-training on natural images. To this end, we present a dataset comprising of 5,014,174 gastrointestinal endoscopic images from eight different medical centers (GastroNet-5M), and exploit self-supervised learning with SimCLRv2, MoCov2 and DINO to learn relevant features for in-domain downstream tasks. The learned features are compared to features learned on natural images derived with multiple methods, and variable amounts of data and/or labels (e.g. Billion-scale semi-weakly supervised learning and supervised learning on ImageNet-21k). The effects of the evaluation is performed on five downstream data sets, particularly designed for a variety of gastrointestinal tasks, for example, GIANA for angiodyplsia detection and Kvasir-SEG for polyp segmentation. The findings indicate that self-supervised domain-specific pre-training, specifically using the DINO framework, results into better performing models compared to any supervised pre-training on natural images. On the ResNet50 and Vision-Transformer-small architectures, utilizing self-supervised in-domain pre-training with DINO leads to an average performance boost of 1.63% and 4.62%, respectively, on the downstream datasets. This improvement is measured against the best performance achieved through pre-training on natural images within any of the evaluated frameworks. Moreover, the in-domain pre-trained models also exhibit increased robustness against distortion perturbations (noise, contrast, blur, etc.), where the in-domain pre-trained ResNet50 and Vision-Transformer-small with DINO achieved on average 1.28% and 3.55% higher on the performance metrics, compared to the best performance found for pre-trained models on natural images. Overall, this study highlights the importance of in-domain pre-training for improving the generic nature, scalability and performance of deep learning for medical image analysis. The GastroNet-5M pre-trained weights are made publicly available in our repository: huggingface.co/tgwboers/GastroNet-5M_Pretrained_Weights.

Unsupervised Segmentation of Colonoscopy Images

Self-Supervised and Semi-Supervised Polyp Segmentation using Synthetic Data

Many Birds, One Stone: Medical Image Segmentation with Multiple Partially Labeled Datasets

One-shot Localization and Segmentation of Medical Images with Foundation Models

FCN-Transformer Feature Fusion for Polyp Segmentation

Improving image classification of gastrointestinal endoscopy using curriculum self-supervised learning

[Antimicrobial-drug induced hepatic injuries].

Knowledge Extraction and Distillation from Large-Scale Image-Text Colonoscopy Records Leveraging Large Language and Vision Models

Assessing the Performance of the DINOv2 Self-supervised Learning Vision Transformer Model for the Segmentation of the Left Atrium from MRI Images

Joint one-sided synthetic unpaired image translation and segmentation for colorectal cancer prevention

Foundation models in gastrointestinal endoscopic AI: Impact of architecture, pre-training approach and data efficiency

Colon Cancer Detection using Vision Transformers and Explainable AI

A Study on Self-Supervised Pretraining for Vision Problems in Gastrointestinal Endoscopy

Utilizing adaptive deformable convolution and position embedding for colon polyp segmentation with a visual transformer

AI support for colonoscopy quality control using CNN and transformer architectures

Disturbances of sleep by noise

Semi-Supervised Segmentation Framework for Gastrointestinal Lesion Diagnosis in Endoscopic Images

A deep weakly semi-supervised framework for endoscopic lesion segmentation

Semi-supervised semantic segmentation of prostate and organs-at-risk on 3D pelvic CT images

EG-TransUNet: a transformer-based U-Net with enhanced and guided models for biomedical image segmentation

SegStitch: Multidimensional Transformer for Robust and Efficient Medical Imaging Segmentation