Token Compensator: Altering Inference Cost of Vision Transformer without Re-Tuning

Shibo Jie,Yehui Tang,Jianyuan Guo,Zhi-Hong Deng,Kai Han,Yunhe Wang
2024-08-13
Abstract:Token compression expedites the training and inference of Vision Transformers (ViTs) by reducing the number of the redundant tokens, e.g., pruning inattentive tokens or merging similar tokens. However, when applied to downstream tasks, these approaches suffer from significant performance drop when the compression degrees are mismatched between training and inference stages, which limits the application of token compression on off-the-shelf trained models. In this paper, we propose a model arithmetic framework to decouple the compression degrees between the two stages. In advance, we additionally perform a fast parameter-efficient self-distillation stage on the pre-trained models to obtain a small plugin, called Token Compensator (ToCom), which describes the gap between models across different compression degrees. During inference, ToCom can be directly inserted into any downstream off-the-shelf models with any mismatched training and inference compression degrees to acquire universal performance improvements without further training. Experiments on over 20 downstream tasks demonstrate the effectiveness of our framework. On CIFAR100, fine-grained visual classification, and VTAB-1k, ToCom can yield up to a maximum improvement of 2.3%, 1.5%, and 2.0% in the average performance of DeiT-B, respectively. Code: <a class="link-external link-https" href="https://github.com/JieShibo/ToCom" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of performance degradation when Vision Transformer (ViT) models use different degrees of compression during the training and inference stages. Specifically, when the token compression degrees in the training and inference stages are inconsistent, the existing token compression methods will lead to a significant decline in model performance. This limits the application of pre - trained models in downstream tasks, especially in cases where the computational cost needs to be adjusted dynamically. #### Main problems: 1. **Performance degradation due to inconsistent compression degrees in training and inference stages**: When the existing token compression methods (such as pruning unimportant tokens or merging similar tokens) are applied to downstream tasks, if the compression degrees in the training and inference stages do not match, it will lead to a significant decline in performance. 2. **Poor application flexibility**: In order to adapt to different computational resource requirements, it is usually necessary to retrain the model for each compression degree, which brings huge training and storage costs. 3. **Challenges in dynamically adjusting computational costs**: In practical applications, it may be necessary to dynamically adjust the computational cost of the model according to the server load, and the existing methods cannot support this well. #### Solutions: The authors propose a framework named TokenCompensator (ToCom) to solve the above problems. ToCom is a lightweight plug - in module that can compensate for the gap between different compression degrees without the need for retraining, thereby improving the performance of the model under different compression degrees. Specifically: - **Decoupling the compression degree**: ToCom can decouple the compression degrees in the training and inference stages, so that the model can flexibly adjust the compression degree during inference without affecting performance. - **Fast self - distillation**: By performing fast and parameter - efficient self - distillation on the pre - trained data set, a small plug - in that describes the gap between different compression degrees is obtained. - **Wide applicability**: ToCom can be directly inserted into any pre - trained model and is suitable for various downstream tasks and different token compression methods. Through these improvements, ToCom enables pre - trained models to maintain high performance under different compression degrees, thereby improving the application flexibility and efficiency of the ViT model. ### Summary The main contribution of this paper is the proposal of TokenCompensator (ToCom), a framework that can effectively solve the problem of performance degradation caused by inconsistent compression degrees in the training and inference stages of Vision Transformer. ToCom not only improves the flexibility of the model but also reduces the need for retraining, enabling the ViT model to better adapt to different computational resource requirements.