Abstract:This work presents Adaptive Local-then-Global Merging (ALGM), a token reduction method for semantic segmentation networks that use plain Vision Transformers. ALGM merges tokens in two stages: (1) In the first network layer, it merges similar tokens within a small local window and (2) halfway through the network, it merges similar tokens across the entire image. This is motivated by an analysis in which we found that, in those situations, tokens with a high cosine similarity can likely be merged without a drop in segmentation quality. With extensive experiments across multiple datasets and network configurations, we show that ALGM not only significantly improves the throughput by up to 100%, but can also enhance the mean IoU by up to +1.1, thereby achieving a better trade-off between segmentation quality and efficiency than existing methods. Moreover, our approach is adaptive during inference, meaning that the same model can be used for optimal efficiency or accuracy, depending on the application. Code is available at <a class="link-external link-https" href="https://tue-mps.github.io/ALGM" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: when using pure Vision Transformer (ViT) for semantic segmentation tasks, how to improve computational efficiency by reducing the number of tokens processed without degrading the segmentation quality. Specifically, the authors propose a new method - Adaptive Local - then - Global Merging (ALGM) to achieve a better balance between efficiency and segmentation quality. ### Problem Background 1. **Computational Complexity Problem**: - The computational complexity of ViT's multi - head self - attention mechanism increases quadratically with the number of input pixels, which is especially obvious on high - resolution images, resulting in low computational efficiency. 2. **Limitations of Existing Methods**: - **Token Pruning**: It is effective for image classification but not suitable for semantic segmentation because each token needs to be predicted. - **Token Pausing/Halting**: Although the discarded tokens are retained and aggregated at the end, it may lead to a decline in segmentation quality. - **Token Sharing/Merging**: Although it maintains the segmentation quality, it introduces additional computational overhead and is only applied once in certain layers, limiting the efficiency improvement. - **Token Merging Applied to Image Classification**: When directly applied to semantic segmentation, it will lead to a significant decline in segmentation quality. ### Research Motivation Based on the above problems, the authors propose two main objectives: 1. **Early Local Merging**: Merge redundant tokens in the early network layers without relying on pre - processing networks while maintaining segmentation quality. 2. **Global Merging**: Apply global token merging in the intermediate layers to further improve efficiency without compromising segmentation quality. ### Solution ALGM solves these problems in the following ways: 1. **Local Merging**: In the first network layer, ALGM adopts a local merging strategy to merge similar tokens within small windows. 2. **Global Merging**: In the intermediate layers of the network, ALGM adopts a global merging mechanism to reduce global token redundancy. 3. **Dynamically Determine the Number of Merges**: Dynamically determine the number of tokens to be merged according to the semantic complexity of the image content. 4. **Restore the Original Token Resolution**: Restore the original token resolution for segmentation prediction during the final prediction. ### Main Contributions 1. Propose a general token - merging framework that combines local and global merging to improve the efficiency and segmentation quality of ViT - based semantic segmentation networks. 2. Analyze the similarity between intra - class and inter - class tokens within local windows and across network layers. 3. Explore the reasons why ALGM improves segmentation quality, including better balance between frequent and rare categories in self - attention operations and token denoising. Through these improvements, ALGM not only significantly improves the processing speed (up to 100%) but also increases the mean Intersection over Union (mean IoU) by up to +1.1, thus achieving a better trade - off between efficiency and segmentation quality.

ALGM: Adaptive Local-then-Global Token Merging for Efficient Semantic Segmentation with Plain Vision Transformers

Content-aware Token Sharing for Efficient Semantic Segmentation with Vision Transformers

[Aggressive fibromatosis in childhood].

Segformer++: Efficient Token-Merging Strategies for High-Resolution Semantic Segmentation

Efficient Multi-modal Large Language Models via Visual Token Grouping

Attention Guided Global Enhancement and Local Refinement Network for Semantic Segmentation

Training-Free Acceleration of ViTs with Delayed Spatial Merging

LoG-VMamba: Local-Global Vision Mamba for Medical Image Segmentation

Adapting Vision-Language Model with Fine-grained Semantics for Open-Vocabulary Segmentation

LAVT: Language-Aware Vision Transformer for Referring Image Segmentation

VLTP: Vision-Language Guided Token Pruning for Task-Oriented Segmentation

LACTNet: A Lightweight Real-Time Semantic Segmentation Network Based on an Aggregated Convolutional Neural Network and Transformer

Learned Thresholds Token Merging and Pruning for Vision Transformers

ELANet: Effective Lightweight Attention-Guided Network for Real-Time Semantic Segmentation

Token-Label Alignment for Vision Transformers.

Efficient Semantic Segmentation via Lightweight Multiple-Information Interaction Network

Video Token Merging for Long-form Video Understanding

Attention based lightweight asymmetric network for real-time semantic segmentation

Making Vision Transformers Efficient from A Token Sparsification View

SegViTv2: Exploring Efficient and Continual Semantic Segmentation with Plain Vision Transformers.

Multilevel Geometric Feature Embedding in Transformer Network for ALS Point Cloud Semantic Segmentation