AiluRus: A Scalable ViT Framework for Dense Prediction

Jin Li,Yaoming Wang,Xiaopeng Zhang,Bowen Shi,Dongsheng Jiang,Chenglin Li,Wenrui Dai,Hongkai Xiong,Qi Tian
2023-11-02
Abstract:Vision transformers (ViTs) have emerged as a prevalent architecture for vision tasks owing to their impressive performance. However, when it comes to handling long token sequences, especially in dense prediction tasks that require high-resolution input, the complexity of ViTs increases significantly. Notably, dense prediction tasks, such as semantic segmentation or object detection, emphasize more on the contours or shapes of objects, while the texture inside objects is less informative. Motivated by this observation, we propose to apply adaptive resolution for different regions in the image according to their importance. Specifically, at the intermediate layer of the ViT, we utilize a spatial-aware density-based clustering algorithm to select representative tokens from the token sequence. Once the representative tokens are determined, we proceed to merge other tokens into their closest representative token. Consequently, semantic similar tokens are merged together to form low-resolution regions, while semantic irrelevant tokens are preserved independently as high-resolution regions. This strategy effectively reduces the number of tokens, allowing subsequent layers to handle a reduced token sequence and achieve acceleration. We evaluate our proposed method on three different datasets and observe promising performance. For example, the "Segmenter ViT-L" model can be accelerated by 48% FPS without fine-tuning, while maintaining the performance. Additionally, our method can be applied to accelerate fine-tuning as well. Experimental results demonstrate that we can save 52% training time while accelerating 2.46 times FPS with only a 0.09% performance drop. The code is available at <a class="link-external link-https" href="https://github.com/caddyless/ailurus/tree/main" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
This paper aims to address the issue of significantly increased complexity when Vision Transformers (ViT) handle long sequences in dense prediction tasks. Specifically: - **Problem Background**: Although ViT performs excellently in visual tasks, its computational complexity rises sharply when dealing with high-resolution inputs, especially in dense prediction tasks such as semantic segmentation or object detection. - **Key Observation**: Dense prediction tasks focus more on the contours or shapes of objects, while the texture information inside the objects is relatively less important. - **Solution**: The authors propose a method called AiluRus to accelerate ViT through adaptive resolution. The specific approach involves using a density-based spatial clustering algorithm in the intermediate layers of ViT to select representative tokens and merge other tokens into these representative tokens. This method can significantly reduce the number of tokens, thereby speeding up the processing of subsequent layers. - **Effect Demonstration**: Experimental results show that AiluRus can significantly improve the inference speed of ViT across different datasets while maintaining high performance. For example, for the "Segmenter ViT-L" model, the frame rate can be increased by 48% without fine-tuning, and training time can be greatly reduced (by about 52%) during fine-tuning, with only a 0.09% performance loss. In summary, this paper aims to accelerate the application of ViT in dense prediction tasks through an adaptive resolution strategy, thereby significantly improving computational efficiency while ensuring performance.