Lightweight Vision Transformer for Small Data Sets

Wenxing Fang,Yanxu Su
DOI: https://doi.org/10.1109/yac59482.2023.10401745
2023-01-01
Abstract:With the rapid development of Transformers in the field of computer vision, models based on Transformers have become highly competitive architectures in the area. Although variants of Transformer models have achieved increasing accuracy on image classification tasks, the size of the training set and the number of parameters required by the models have increased dramatically. When dealing with small datasets, such models face problems such as overfitting and undergeneralization, leading to poor accuracy on the test set. We propose a new lightweight vision transformer (LVT) to address these issues. We reconstructed the backbone network, which learns the relationship between pixels through local window self-attention and global self-attention computation. We also use the attention pooling approach to fuse the token sequences generated by the backbone network more meticulously. We trained on the CIFAR-10 and CIFAR-100 datasets from scratch and compared them with a modern convolutional neural network. The experimental results show that LVT outperforms the modern convolutional neural network in terms of accuracy and efficiency. On the test set of CIFAR-10, we have obtained an accuracy of 96.83%, which indicates that our model can effectively solve the problems facing the training of small datasets and has a wide range of application prospects.
What problem does this paper attempt to address?