Abstract:Vision transformer has achieved impressive performance for many vision tasks. However, it may suffer from high redundancy in capturing local features for shallow layers. Local self-attention or early-stage convolutions are thus utilized, which sacrifice the capacity to capture long-range dependency. A challenge then arises: can we access efficient and effective global context modeling at the early stages of a neural network? To address this issue, we draw inspiration from the design of superpixels, which reduces the number of image primitives in subsequent processing, and introduce super tokens into vision transformer. Super tokens attempt to provide a semantically meaningful tessellation of visual content, thus reducing the token number in self-attention as well as preserving global modeling. Specifically, we propose a simple yet strong super token attention (STA) mechanism with three steps: the first samples super tokens from visual tokens via sparse association learning, the second performs self-attention on super tokens, and the last maps them back to the original token space. STA decomposes vanilla global attention into multiplications of a sparse association map and a low-dimensional attention, leading to high efficiency in capturing global dependencies. Based on STA, we develop a hierarchical vision transformer. Extensive experiments demonstrate its strong performance on various vision tasks. In particular, without any extra training data or label, it achieves 86.4% top-1 accuracy on ImageNet-1K with less than 100M parameters. It also achieves 53.9 box AP and 46.8 mask AP on the COCO detection task, and 51.9 mIOU on the ADE20K semantic segmentation task. Code is released at <a class="link-external link-https" href="https://github.com/hhb072/STViT" rel="external noopener nofollow">this https URL</a>.

SweetTokenizer: Semantic-Aware Spatial-Temporal Tokenizer for Compact Visual Discretization

Making Vision Transformers Efficient from A Token Sparsification View

VidTok: A Versatile and Open-Source Video Tokenizer

OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation

Superpixel Tokenization for Vision Transformers: Preserving Semantic Integrity in Visual Tokens

Prune Spatio-temporal Tokens by Semantic-aware Temporal Accumulation

Image and Video Tokenization with Binary Spherical Quantization

Video Transformer based Video Quality Assessment with Spatiotemporally adaptive Token Selection and Assembly

Token Compensator: Altering Inference Cost of Vision Transformer without Re-Tuning

Efficient Vision Transformer via Token Merger

SVT: Supertoken Video Transformer for Efficient Video Understanding

SoftVQ-VAE: Efficient 1-Dimensional Continuous Tokenizer

Taming Scalable Visual Tokenizer for Autoregressive Image Generation

A Spitting Image: Modular Superpixel Tokenization in Vision Transformers

Efficient Video Transformers with Spatial-Temporal Token Selection

TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?

FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual Token Compression

Sub-token ViT Embedding via Stochastic Resonance Transformers

Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully Exploiting Self-Attention

Vision Transformer with Super Token Sampling

Dynamic Token-Pass Transformers for Semantic Segmentation