Transformer-Based Disease Identification for Small-Scale Imbalanced Capsule Endoscopy Dataset

Long Bai,Liangyu Wang,Tong Chen,Yuanhao Zhao,Hongliang Ren
DOI: https://doi.org/10.3390/electronics11172747
IF: 2.9
2022-09-01
Electronics
Abstract:Vision Transformer (ViT) is emerging as a new leader in computer vision with its outstanding performance in many tasks (e.g., ImageNet-22k, JFT-300M). However, the success of ViT relies on pretraining on large datasets. It is difficult for us to use ViT to train from scratch on a small-scale imbalanced capsule endoscopic image dataset. This paper adopts a Transformer neural network with a spatial pooling configuration. Transfomer's self-attention mechanism enables it to capture long-range information effectively, and the exploration of ViT spatial structure by pooling can further improve the performance of ViT on our small-scale capsule endoscopy dataset. We trained from scratch on two publicly available datasets for capsule endoscopy disease classification, obtained 79.15% accuracy on the multi-classification task of the Kvasir-Capsule dataset, and 98.63% accuracy on the binary classification task of the Red Lesion Endoscopy dataset.
engineering, electrical & electronic,computer science, information systems,physics, applied
What problem does this paper attempt to address?