ALGM: Adaptive Local-then-Global Token Merging for Efficient Semantic Segmentation with Plain Vision Transformers

Narges Norouzi,Svetlana Orlova,Daan de Geus,Gijs Dubbelman
2024-06-14
Abstract:This work presents Adaptive Local-then-Global Merging (ALGM), a token reduction method for semantic segmentation networks that use plain Vision Transformers. ALGM merges tokens in two stages: (1) In the first network layer, it merges similar tokens within a small local window and (2) halfway through the network, it merges similar tokens across the entire image. This is motivated by an analysis in which we found that, in those situations, tokens with a high cosine similarity can likely be merged without a drop in segmentation quality. With extensive experiments across multiple datasets and network configurations, we show that ALGM not only significantly improves the throughput by up to 100%, but can also enhance the mean IoU by up to +1.1, thereby achieving a better trade-off between segmentation quality and efficiency than existing methods. Moreover, our approach is adaptive during inference, meaning that the same model can be used for optimal efficiency or accuracy, depending on the application. Code is available at <a class="link-external link-https" href="https://tue-mps.github.io/ALGM" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: when using pure Vision Transformer (ViT) for semantic segmentation tasks, how to improve computational efficiency by reducing the number of tokens processed without degrading the segmentation quality. Specifically, the authors propose a new method - Adaptive Local - then - Global Merging (ALGM) to achieve a better balance between efficiency and segmentation quality. ### Problem Background 1. **Computational Complexity Problem**: - The computational complexity of ViT's multi - head self - attention mechanism increases quadratically with the number of input pixels, which is especially obvious on high - resolution images, resulting in low computational efficiency. 2. **Limitations of Existing Methods**: - **Token Pruning**: It is effective for image classification but not suitable for semantic segmentation because each token needs to be predicted. - **Token Pausing/Halting**: Although the discarded tokens are retained and aggregated at the end, it may lead to a decline in segmentation quality. - **Token Sharing/Merging**: Although it maintains the segmentation quality, it introduces additional computational overhead and is only applied once in certain layers, limiting the efficiency improvement. - **Token Merging Applied to Image Classification**: When directly applied to semantic segmentation, it will lead to a significant decline in segmentation quality. ### Research Motivation Based on the above problems, the authors propose two main objectives: 1. **Early Local Merging**: Merge redundant tokens in the early network layers without relying on pre - processing networks while maintaining segmentation quality. 2. **Global Merging**: Apply global token merging in the intermediate layers to further improve efficiency without compromising segmentation quality. ### Solution ALGM solves these problems in the following ways: 1. **Local Merging**: In the first network layer, ALGM adopts a local merging strategy to merge similar tokens within small windows. 2. **Global Merging**: In the intermediate layers of the network, ALGM adopts a global merging mechanism to reduce global token redundancy. 3. **Dynamically Determine the Number of Merges**: Dynamically determine the number of tokens to be merged according to the semantic complexity of the image content. 4. **Restore the Original Token Resolution**: Restore the original token resolution for segmentation prediction during the final prediction. ### Main Contributions 1. Propose a general token - merging framework that combines local and global merging to improve the efficiency and segmentation quality of ViT - based semantic segmentation networks. 2. Analyze the similarity between intra - class and inter - class tokens within local windows and across network layers. 3. Explore the reasons why ALGM improves segmentation quality, including better balance between frequent and rare categories in self - attention operations and token denoising. Through these improvements, ALGM not only significantly improves the processing speed (up to 100%) but also increases the mean Intersection over Union (mean IoU) by up to +1.1, thus achieving a better trade - off between efficiency and segmentation quality.