Abstract:Vision transformers (ViTs) have emerged as a prevalent architecture for vision tasks owing to their impressive performance. However, when it comes to handling long token sequences, especially in dense prediction tasks that require high-resolution input, the complexity of ViTs increases significantly. Notably, dense prediction tasks, such as semantic segmentation or object detection, emphasize more on the contours or shapes of objects, while the texture inside objects is less informative. Motivated by this observation, we propose to apply adaptive resolution for different regions in the image according to their importance. Specifically, at the intermediate layer of the ViT, we utilize a spatial-aware density-based clustering algorithm to select representative tokens from the token sequence. Once the representative tokens are determined, we proceed to merge other tokens into their closest representative token. Consequently, semantic similar tokens are merged together to form low-resolution regions, while semantic irrelevant tokens are preserved independently as high-resolution regions. This strategy effectively reduces the number of tokens, allowing subsequent layers to handle a reduced token sequence and achieve acceleration. We evaluate our proposed method on three different datasets and observe promising performance. For example, the "Segmenter ViT-L" model can be accelerated by 48% FPS without fine-tuning, while maintaining the performance. Additionally, our method can be applied to accelerate fine-tuning as well. Experimental results demonstrate that we can save 52% training time while accelerating 2.46 times FPS with only a 0.09% performance drop. The code is available at <a class="link-external link-https" href="https://github.com/caddyless/ailurus/tree/main" rel="external noopener nofollow">this https URL</a>.

PredToken: Predicting Unknown Tokens and Beyond with Coarse-to-Fine Iterative Decoding

Object Recognition as Next Token Prediction

Token Sparsification for Faster Medical Image Segmentation

PredRNN: A Recurrent Neural Network for Spatiotemporal Predictive Learning

TokenUnify: Scalable Autoregressive Visual Pre-training with Mixture Token Prediction

TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?

High-Resolution Image Synthesis via Next-Token Prediction

Unleashing Transformers: Parallel Token Prediction with Discrete Absorbing Diffusion for Fast High-Resolution Image Generation from Vector-Quantized Codes

$ε$-VAE: Denoising as Visual Decoding

Expediting Large-Scale Vision Transformer for Dense Prediction without Fine-tuning

PredRNN: Recurrent Neural Networks for Predictive Learning using Spatiotemporal LSTMs

PrevPredMap: Exploring Temporal Modeling with Previous Predictions for Online Vectorized HD Map Construction

TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation

PredRNN++: Towards A Resolution of the Deep-in-Time Dilemma in Spatiotemporal Predictive Learning

Collaborative decoding of critical tokens for boosting factuality of large language models

Future Token Prediction -- Causal Language Modelling with Per-Token Semantic State Vector for Multi-Token Prediction

An Image is Worth 32 Tokens for Reconstruction and Generation

OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation

Generalized Decoding for Pixel, Image, and Language

AiluRus: A Scalable ViT Framework for Dense Prediction

Principles of Visual Tokens for Efficient Video Understanding