Abstract:In this paper, we introduce the <a class="link-external link-http" href="http://big.LITTLE" rel="external noopener nofollow">this http URL</a> Vision Transformer, an innovative architecture aimed at achieving efficient visual recognition. This dual-transformer system is composed of two distinct blocks: the big performance block, characterized by its high capacity and substantial computational demands, and the LITTLE efficiency block, designed for speed with lower capacity. The key innovation of our approach lies in its dynamic inference mechanism. When processing an image, our system determines the importance of each token and allocates them accordingly: essential tokens are processed by the high-performance big model, while less critical tokens are handled by the more efficient little model. This selective processing significantly reduces computational load without sacrificing the overall performance of the model, as it ensures that detailed analysis is reserved for the most important information. To validate the effectiveness of our <a class="link-external link-http" href="http://big.LITTLE" rel="external noopener nofollow">this http URL</a> Vision Transformer, we conducted comprehensive experiments on image classification and segment anything task. Our results demonstrate that the <a class="link-external link-http" href="http://big.LITTLE" rel="external noopener nofollow">this http URL</a> architecture not only maintains high accuracy but also achieves substantial computational savings. Specifically, our approach enables the efficient handling of large-scale visual recognition tasks by dynamically balancing the trade-offs between performance and efficiency. The success of our method underscores the potential of hybrid models in optimizing both computation and performance in visual recognition tasks, paving the way for more practical and scalable deployment of advanced neural networks in real-world applications.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve the problem of slow inference speed of the Vision Transformer (ViT) model in visual recognition tasks. Specifically: 1. **Background problems**: - Although ViT performs excellently in tasks such as image classification, image segmentation and object detection, its inference speed is slow, especially when dealing with large - scale visual recognition tasks. - The large - scale ViT model (such as ViT - Huge) may run at a speed lower than 2 FPS on high - performance GPUs, which severely limits its deployment in practical applications. 2. **Deficiencies of existing solutions**: - Some methods improve the inference speed through model distillation or reducing the precision of model parameters, but these methods usually sacrifice the model performance. - Other methods reduce the amount of calculation by pruning input tokens, but this may lead to the loss of image context information and affect the performance of downstream tasks, especially tasks that need to maintain the spatial structure (such as image segmentation). 3. **Solutions proposed in the paper**: - A new architecture - big.LITTLE Vision Transformer (bLViT) is introduced, which consists of two different blocks: a high - performance block (P - Block) and an efficient block (E - Block). - Dynamic inference mechanism: when processing an image, the system makes an assignment according to the importance of each token. The important tokens are processed by the high - performance block, and the less important tokens are processed by the efficient block. - This selective processing significantly reduces the computational load while maintaining the overall performance of the model, ensuring that detailed analysis is only focused on the most important information. 4. **Experimental verification**: - The effectiveness of bLViT is verified through comprehensive experiments on image classification and image segmentation tasks. - The experimental results show that bLViT not only maintains high accuracy, but also achieves significant computational savings, especially in large - scale visual recognition tasks. ### Summary The main contribution of the paper is to propose a new ViT architecture - big.LITTLE Vision Transformer. Through the dynamic allocation of token processing methods, it effectively solves the trade - off problem between the inference speed and performance of the ViT model, providing a new solution for efficient deployment in practical applications.

big.LITTLE Vision Transformer for Efficient Visual Recognition

AdaViT: Adaptive Vision Transformers for Efficient Image Recognition

Lite Vision Transformer with Enhanced Self-Attention

LF-ViT: Reducing Spatial Redundancy in Vision Transformer for Efficient Image Recognition

GhostViT: Expediting Vision Transformers Via Cheap Operations

Depth-Wise Convolutions in Vision Transformers for Efficient Training on Small Datasets

Super Vision Transformer

A Simple Single-Scale Vision Transformer for Object Localization and Instance Segmentation

Glance-and-Gaze Vision Transformer

Efficient Visual Transformer by Learnable Token Merging

Not All Images are Worth 16x16 Words: Dynamic Transformers for Efficient Image Recognition

MAFormer: A transformer network with multi-scale attention fusion for visual recognition

Three things everyone should know about Vision Transformers

EATFormer: Improving Vision Transformer Inspired by Evolutionary Algorithm

A novel dual-granularity lightweight transformer for vision tasks

Visformer: The Vision-friendly Transformer

Vision Transformers: From Semantic Segmentation to Dense Prediction

EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention

Vicinity Vision Transformer

Enhanced Vision Transformer with Dual-Dimensional Self-Attention for Image Recognition

ResT: an Efficient Transformer for Visual Recognition