Abstract:Token compression expedites the training and inference of Vision Transformers (ViTs) by reducing the number of the redundant tokens, e.g., pruning inattentive tokens or merging similar tokens. However, when applied to downstream tasks, these approaches suffer from significant performance drop when the compression degrees are mismatched between training and inference stages, which limits the application of token compression on off-the-shelf trained models. In this paper, we propose a model arithmetic framework to decouple the compression degrees between the two stages. In advance, we additionally perform a fast parameter-efficient self-distillation stage on the pre-trained models to obtain a small plugin, called Token Compensator (ToCom), which describes the gap between models across different compression degrees. During inference, ToCom can be directly inserted into any downstream off-the-shelf models with any mismatched training and inference compression degrees to acquire universal performance improvements without further training. Experiments on over 20 downstream tasks demonstrate the effectiveness of our framework. On CIFAR100, fine-grained visual classification, and VTAB-1k, ToCom can yield up to a maximum improvement of 2.3%, 1.5%, and 2.0% in the average performance of DeiT-B, respectively. Code: <a class="link-external link-https" href="https://github.com/JieShibo/ToCom" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of performance degradation when Vision Transformer (ViT) models use different degrees of compression during the training and inference stages. Specifically, when the token compression degrees in the training and inference stages are inconsistent, the existing token compression methods will lead to a significant decline in model performance. This limits the application of pre - trained models in downstream tasks, especially in cases where the computational cost needs to be adjusted dynamically. #### Main problems: 1. **Performance degradation due to inconsistent compression degrees in training and inference stages**: When the existing token compression methods (such as pruning unimportant tokens or merging similar tokens) are applied to downstream tasks, if the compression degrees in the training and inference stages do not match, it will lead to a significant decline in performance. 2. **Poor application flexibility**: In order to adapt to different computational resource requirements, it is usually necessary to retrain the model for each compression degree, which brings huge training and storage costs. 3. **Challenges in dynamically adjusting computational costs**: In practical applications, it may be necessary to dynamically adjust the computational cost of the model according to the server load, and the existing methods cannot support this well. #### Solutions: The authors propose a framework named TokenCompensator (ToCom) to solve the above problems. ToCom is a lightweight plug - in module that can compensate for the gap between different compression degrees without the need for retraining, thereby improving the performance of the model under different compression degrees. Specifically: - **Decoupling the compression degree**: ToCom can decouple the compression degrees in the training and inference stages, so that the model can flexibly adjust the compression degree during inference without affecting performance. - **Fast self - distillation**: By performing fast and parameter - efficient self - distillation on the pre - trained data set, a small plug - in that describes the gap between different compression degrees is obtained. - **Wide applicability**: ToCom can be directly inserted into any pre - trained model and is suitable for various downstream tasks and different token compression methods. Through these improvements, ToCom enables pre - trained models to maintain high performance under different compression degrees, thereby improving the application flexibility and efficiency of the ViT model. ### Summary The main contribution of this paper is the proposal of TokenCompensator (ToCom), a framework that can effectively solve the problem of performance degradation caused by inconsistent compression degrees in the training and inference stages of Vision Transformer. ToCom not only improves the flexibility of the model but also reduces the need for retraining, enabling the ViT model to better adapt to different computational resource requirements.

Token Compensator: Altering Inference Cost of Vision Transformer without Re-Tuning

Multi-Dimension Compression of Feed-Forward Network in Vision Transformers

CAIT: Triple-Win Compression towards High Accuracy, Fast Inference, and Favorable Transferability For ViTs

DiffRate : Differentiable Compression Rate for Efficient Vision Transformers

Joint Token Pruning and Squeezing Towards More Aggressive Compression of Vision Transformers

Making Vision Transformers Efficient from A Token Sparsification View

COMCAT: Towards Efficient Compression and Customization of Attention-Based Vision Models

Efficient Vision Transformer via Token Merger

TPC-ViT: Token Propagation Controller for Efficient Vision Transformer

Efficient Online Inference of Vision Transformers by Training-Free Tokenization

UCC: A Unified Cascade Compression Framework for Vision Transformer Models

SAViT: Structure-Aware Vision Transformer Pruning Via Collaborative Optimization.

Unified Visual Transformer Compression

Vote&Mix: Plug-and-Play Token Reduction for Efficient Vision Transformer

Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully Exploiting Self-Attention

GTP-ViT: Efficient Vision Transformers via Graph-based Token Propagation

No Token Left Behind: Efficient Vision Transformer via Dynamic Token Idling

PPT: Token Pruning and Pooling for Efficient Vision Transformers

Make A Long Image Short: Adaptive Token Length for Vision Transformers

Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer

Make Your ViT-based Multi-view 3D Detectors Faster via Token Compression