Abstract:The groundbreaking performance of transformers in Natural Language Processing (NLP) tasks has led to their replacement of traditional Convolutional Neural Networks (CNNs), owing to the efficiency and accuracy achieved through the self-attention mechanism. This success has inspired researchers to explore the use of transformers in computer vision tasks to attain enhanced long-term semantic awareness. Vision transformers (ViTs) have excelled in various computer vision tasks due to their superior ability to capture long-distance dependencies using the self-attention mechanism. Contemporary ViTs like Data Efficient Transformers (DeiT) can effectively learn both global semantic information and local texture information from images, achieving performance comparable to traditional CNNs. However, their impressive performance comes with a high computational cost due to very large number of parameters, hindering their deployment on devices with limited resources like smartphones, cameras, drones etc. Additionally, ViTs require a large amount of data for training to achieve performance comparable to benchmark CNN models. Therefore, we identified two key challenges in deploying ViTs on smaller form factor devices: the high computational requirements of large models and the need for extensive training data. As a solution to these challenges, we propose compressing large ViT models using Knowledge Distillation (KD), which is implemented data-free to circumvent limitations related to data availability. Additionally, we conducted experiments on object detection within the same environment in addition to classification tasks. Based on our analysis, we found that datafree knowledge distillation is an effective method to overcome both issues, enabling the deployment of ViTs on less resourceconstrained devices.

Training a Vision Transformer from scratch in less than 24 hours with 1 GPU

GhostViT: Expediting Vision Transformers Via Cheap Operations

Training Vision Transformers with only 2040 Images.

Effective Vision Transformer Training: A Data-Centric Perspective

Super Vision Transformer

How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers

Auto-scaling Vision Transformers without Training

Automated Progressive Learning for Efficient Training of Vision Transformers

Budgeted Training for Vision Transformer

How to Train Vision Transformer on Small-scale Datasets?

Towards Efficient Adversarial Training on Vision Transformers

Depth-Wise Convolutions in Vision Transformers for Efficient Training on Small Datasets

Improving Vision Transformers by Revisiting High-Frequency Components

TurboViT: Generating Fast Vision Transformers via Generative Architecture Search

Optimizing Vision Transformers with Data-Free Knowledge Transfer

Peeling the Onion: Hierarchical Reduction of Data Redundancy for Efficient Vision Transformer Training.

Boost Vision Transformer with GPU-Friendly Sparsity and Quantization

ViTAR: Vision Transformer with Any Resolution

Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer

A Survey on Efficient Vision Transformers: Algorithms, Techniques, and Performance Benchmarking

An Experimental Study on Exploring Strong Lightweight Vision Transformers via Masked Image Modeling Pre-Training