Abstract:In this paper, we introduce the <a class="link-external link-http" href="http://big.LITTLE" rel="external noopener nofollow">this http URL</a> Vision Transformer, an innovative architecture aimed at achieving efficient visual recognition. This dual-transformer system is composed of two distinct blocks: the big performance block, characterized by its high capacity and substantial computational demands, and the LITTLE efficiency block, designed for speed with lower capacity. The key innovation of our approach lies in its dynamic inference mechanism. When processing an image, our system determines the importance of each token and allocates them accordingly: essential tokens are processed by the high-performance big model, while less critical tokens are handled by the more efficient little model. This selective processing significantly reduces computational load without sacrificing the overall performance of the model, as it ensures that detailed analysis is reserved for the most important information. To validate the effectiveness of our <a class="link-external link-http" href="http://big.LITTLE" rel="external noopener nofollow">this http URL</a> Vision Transformer, we conducted comprehensive experiments on image classification and segment anything task. Our results demonstrate that the <a class="link-external link-http" href="http://big.LITTLE" rel="external noopener nofollow">this http URL</a> architecture not only maintains high accuracy but also achieves substantial computational savings. Specifically, our approach enables the efficient handling of large-scale visual recognition tasks by dynamically balancing the trade-offs between performance and efficiency. The success of our method underscores the potential of hybrid models in optimizing both computation and performance in visual recognition tasks, paving the way for more practical and scalable deployment of advanced neural networks in real-world applications.

AdaViT: Adaptive Vision Transformers for Efficient Image Recognition

SAViT: Structure-Aware Vision Transformer Pruning Via Collaborative Optimization.

AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition

EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention

Super Vision Transformer

Not All Images are Worth 16x16 Words: Dynamic Transformers for Efficient Image Recognition

ACC-ViT : Atrous Convolution's Comeback in Vision Transformers

Lite Vision Transformer with Enhanced Self-Attention

Towards Efficient Adversarial Training on Vision Transformers

big.LITTLE Vision Transformer for Efficient Visual Recognition

ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias

DAT++: Spatially Dynamic Vision Transformer with Deformable Attention

LF-ViT: Reducing Spatial Redundancy in Vision Transformer for Efficient Image Recognition

Not All Images Are Worth 16x16 Words: Dynamic Vision Transformers with Adaptive Sequence Length

Depth-Wise Convolutions in Vision Transformers for Efficient Training on Small Datasets

DctViT: Discrete Cosine Transform Meet Vision Transformers

MaxViT: Multi-Axis Vision Transformer

Vision Transformer with Sparse Scan Prior

MIA-Former: Efficient and Robust Vision Transformers via Multi-grained Input-Adaptation

EATFormer: Improving Vision Transformer Inspired by Evolutionary Algorithm