Abstract:In this paper, we introduce the <a class="link-external link-http" href="http://big.LITTLE" rel="external noopener nofollow">this http URL</a> Vision Transformer, an innovative architecture aimed at achieving efficient visual recognition. This dual-transformer system is composed of two distinct blocks: the big performance block, characterized by its high capacity and substantial computational demands, and the LITTLE efficiency block, designed for speed with lower capacity. The key innovation of our approach lies in its dynamic inference mechanism. When processing an image, our system determines the importance of each token and allocates them accordingly: essential tokens are processed by the high-performance big model, while less critical tokens are handled by the more efficient little model. This selective processing significantly reduces computational load without sacrificing the overall performance of the model, as it ensures that detailed analysis is reserved for the most important information. To validate the effectiveness of our <a class="link-external link-http" href="http://big.LITTLE" rel="external noopener nofollow">this http URL</a> Vision Transformer, we conducted comprehensive experiments on image classification and segment anything task. Our results demonstrate that the <a class="link-external link-http" href="http://big.LITTLE" rel="external noopener nofollow">this http URL</a> architecture not only maintains high accuracy but also achieves substantial computational savings. Specifically, our approach enables the efficient handling of large-scale visual recognition tasks by dynamically balancing the trade-offs between performance and efficiency. The success of our method underscores the potential of hybrid models in optimizing both computation and performance in visual recognition tasks, paving the way for more practical and scalable deployment of advanced neural networks in real-world applications.

Gated Channel Transformation for Visual Recognition

CTFCD: Channel Transformer Based on Full Convolutional Decoder for Single Image Deraining

Linear Context Transform Block.

Reliable or Deceptive? Investigating Gated Features for Smooth Visual Explanations in CNNs

Hierarchical Gate Network for Fine-Grained Visual Recognition.

Competitive Inner-Imaging Squeeze and Excitation for Residual Network

Unified Normalization for Accelerating and Stabilizing Transformers

Channel Equilibrium Networks for Learning Deep Representation

Conv2Former: A Simple Transformer-Style ConvNet for Visual Recognition

Involution: Inverting the Inherence of Convolution for Visual Recognition

big.LITTLE Vision Transformer for Efficient Visual Recognition

Demystify Transformers & Convolutions in Modern Image Deep Networks

UniNet: Unified Architecture Search with Convolution, Transformer, and MLP

Vision Transformers: From Semantic Segmentation to Dense Prediction

SENetV2: Aggregated dense layer for channelwise and global representations

All You Need Is a Few Shifts: Designing Efficient Convolutional Neural Networks for Image Classification

Transform-Invariant Convolutional Neural Networks for Image Classification and Search

Conformer: Local Features Coupling Global Representations for Visual Recognition

Understanding Neural Networks Through Deep Visualization

Leveraging Batch Normalization for Vision Transformers

CSFNet: a compact and efficient convolution-transformer hybrid vision model