Improved Vit via knowledge distallation on small datasets

Jun Wang,Weifeng Liu,Weishan Zhang,Baodi Liu
DOI: https://doi.org/10.1109/ICSP56322.2022.9965295
2022-01-01
Abstract:The transformer-based models have driven revolutionary advances in natural language processing tasks (NLP), and its application in image classification has also achieved corresponding results. The pure transformer-based sequences of image patches can obtain performance comparable to the current optimal convolutional network. But high-performance visual converters use large-scale infrastructure to pre-train hundreds of millions of images, limiting their applications. In this work, given on the transformer model for classification tasks based on training on small datasets (CIFAR-10 and CIFAR-100), we introduced a teacher-student strategy for transformers. It relies on a distillation token to ensure that students learn from the teacher through attention. Especially when using the convnets as a teacher network (we achieve around 90.0% accuracy on CIFAR-100), we obtain results comparable to the convnets.
What problem does this paper attempt to address?