Feature Pyramid Vision Transformer for MedMNIST Classification Decathlon.

Jinwei Liu,Yan Li,Guitao Cao,Yong Liu,Wenming Cao
DOI: https://doi.org/10.1109/ijcnn55064.2022.9892282
2022-01-01
Abstract:MedMNIST is a medical dataset proposed to block the need for medical knowledge, but there is currently no model that can generalize well on all its sub-datasets. Owing to the inadequacy of long-range relation modeling, models based on convolutional neural networks (CNNs) cannot fully learn the information of images. Besides, relying only on high-level features limits the generalization effect as well. All of these remain challenges for MedMNIST Classification Decathlon. In this paper, we proposed Feature Pyramid Vision Transformer (FPViT), a strong alternative for MedMNIST Classification Decathlon. Our FPViT exhibits enhanced feature learning and modeling capabilities, which merits both residual network (ResNet) and Vision Transformer (ViT). Transformers in our model take the features extracted by ResNet as sequences to capture global contexts which compensate for the lack of locality of convolution operations. Moreover, the feature pyramid designed in our model effectively utilizes the multi-scale feature maps from basic layers of ResNet. These multi-scale features from low-level to high level enable our model to have better adaptability. And, the final prediction is based on the multi-scale ViT and the original ResNet heads. Through experiments, our FPViT can achieve superior classification and generalization on MedMNIST than state-of-the-art methods.
What problem does this paper attempt to address?