Investigating the Robustness of Vision Transformers against Label Noise in Medical Image Classification

Bidur Khanal,Prashant Shrestha,Sanskar Amgain,Bishesh Khanal,Binod Bhattarai,Cristian A. Linte
2024-02-27
Abstract:Label noise in medical image classification datasets significantly hampers the training of supervised deep learning methods, undermining their generalizability. The test performance of a model tends to decrease as the label noise rate increases. Over recent years, several methods have been proposed to mitigate the impact of label noise in medical image classification and enhance the robustness of the model. Predominantly, these works have employed CNN-based architectures as the backbone of their classifiers for feature extraction. However, in recent years, Vision Transformer (ViT)-based backbones have replaced CNNs, demonstrating improved performance and a greater ability to learn more generalizable features, especially when the dataset is large. Nevertheless, no prior work has rigorously investigated how transformer-based backbones handle the impact of label noise in medical image classification. In this paper, we investigate the architectural robustness of ViT against label noise and compare it to that of CNNs. We use two medical image classification datasets -- COVID-DU-Ex, and NCT-CRC-HE-100K -- both corrupted by injecting label noise at various rates. Additionally, we show that pretraining is crucial for ensuring ViT's improved robustness against label noise in supervised training.
Image and Video Processing,Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The paper aims to explore the robustness of Vision Transformers (ViT) in medical image classification in the presence of label noise and compare it with traditional Convolutional Neural Networks (CNNs). Specifically, the researchers focus on the following aspects: 1. **Impact of Label Noise**: The researchers point out that label noise in medical image classification datasets can severely affect the training effectiveness of supervised learning methods, thereby weakening the model's generalization ability. As the label noise rate increases, the model's test performance usually declines. 2. **Comparison between ViT and CNN**: Although ViT has shown excellent performance in many benchmarks in recent years, there is currently a lack of research on ViT as a backbone network in handling label noise. Therefore, this paper experimentally compares the performance of ViT and CNN (represented by ResNet18) under different label noise rates. 3. **Role of Self-Supervised Pre-Training**: The study finds that self-supervised pre-training is crucial for improving the robustness of ViT in environments with label noise. By using two self-supervised pre-training methods—Masked Autoencoders (MAE) and SimMIM, the performance of ViT in high label noise situations can be significantly enhanced. 4. **Application of Co-Teaching Method**: The researchers also explore the effect of applying the Co-teaching label noise learning method to ViT. The results show that for ViT without pre-training, the effect of Co-teaching is not as good as ResNet18; however, after pre-training, the performance of ViT is significantly better than that of the untrained model. In summary, the core objective of this paper is to evaluate the robustness of ViT in handling the label noise problem in medical image classification tasks and to experimentally demonstrate that appropriate self-supervised pre-training can significantly improve the performance of ViT in the presence of label noise.