SPT-Swin: A Shifted Patch Tokenization Swin Transformer for Image Classification

Gazi Jannatul Ferdous,Khaleda Akhter Sathi,Md. Azad Hossain,M. Ali Akber Dewan
DOI: https://doi.org/10.1109/access.2024.3448304
IF: 3.9
2024-08-31
IEEE Access
Abstract:Recently, the transformer-based model e.g., the vision transformer (ViT) has been extensively used in computer vision tasks. The superior performance of the ViT leads to the requirement of an enormous dataset and the complexity of calculating self-attention between patches is quadratic in nature. To acknowledge these two concerns, this paper proposes a novel shifted patch tokenization swin transformer (SPT-Swin) for the image classification task. The shifted patch tokenization (SPT) compensates for the data deficiency by increasing the data samples based on spatial information of the image patches while the swin transformer provides linear computational complexity by calculating self-attention between the shifted window based patches. For model validation, the SPT-Swin framework is trained on popular benchmark image datasets such as ImageNet-1K, CIFAR-10 and CIFAR-100, and the classification accuracies are found 89.45%, 95.67% and 92.95% respectively. Moreover, the comparative analysis of the proposed model with the existing state-of-the-art models shows that the classification performances are improved by 7.05%, 4.14%, and 8.30% for the ImageNet-1K, CIFAR-10 and CIFAR-100 datasets respectively. Therefore, our proposed SPT-based data augmentation technique with the core swin transformer model could be a data-efficient linear complex-able model for future computer vision tasks.
computer science, information systems,telecommunications,engineering, electrical & electronic
What problem does this paper attempt to address?