General Compression Framework for Efficient Transformer Object Tracking

Lingyi Hong,Jinglun Li,Xinyu Zhou,Shilin Yan,Pinxue Guo,Kaixun Jiang,Zhaoyu Chen,Shuyong Gao,Wei Zhang,Hong Lu,Wenqiang Zhang
2024-09-26
Abstract:Transformer-based trackers have established a dominant role in the field of visual object tracking. While these trackers exhibit promising performance, their deployment on resource-constrained devices remains challenging due to inefficiencies. To improve the inference efficiency and reduce the computation cost, prior approaches have aimed to either design lightweight trackers or distill knowledge from larger teacher models into more compact student trackers. However, these solutions often sacrifice accuracy for speed. Thus, we propose a general model compression framework for efficient transformer object tracking, named CompressTracker, to reduce the size of a pre-trained tracking model into a lightweight tracker with minimal performance degradation. Our approach features a novel stage division strategy that segments the transformer layers of the teacher model into distinct stages, enabling the student model to emulate each corresponding teacher stage more effectively. Additionally, we also design a unique replacement training technique that involves randomly substituting specific stages in the student model with those from the teacher model, as opposed to training the student model in isolation. Replacement training enhances the student model's ability to replicate the teacher model's behavior. To further forcing student model to emulate teacher model, we incorporate prediction guidance and stage-wise feature mimicking to provide additional supervision during the teacher model's compression process. Our framework CompressTracker is structurally agnostic, making it compatible with any transformer architecture. We conduct a series of experiment to verify the effectiveness and generalizability of CompressTracker. Our CompressTracker-4 with 4 transformer layers, which is compressed from OSTrack, retains about 96% performance on LaSOT (66.1% AUC) while achieves 2.17x speed up.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve The paper aims to address the efficiency issues of deploying Transformer-based visual object tracking models on resource-constrained devices. Although existing Transformer-based trackers perform excellently in terms of performance, their high computational cost and low inference efficiency limit their widespread use in practical applications. To improve the inference efficiency of these models and reduce computational costs, existing methods typically achieve this by designing lightweight trackers or distilling the knowledge of large teacher models into smaller student models. However, these methods often trade accuracy for speed. Therefore, the authors propose a general model compression framework—CompressTracker, which efficiently compresses pre-trained Transformer tracking models into lightweight trackers while minimizing performance degradation. Specifically, CompressTracker achieves this goal through the following techniques: 1. **Stage Division Strategy**: Dividing the Transformer layers of the teacher model into multiple stages, allowing the student model to more effectively mimic the behavior of each corresponding stage. 2. **Replacement Training Technique**: Randomly replacing specific stages of the student model with the corresponding stages of the teacher model during training, enhancing the student model's ability to replicate the teacher model's behavior. 3. **Prediction Guidance and Stage Feature Imitation**: Supervising the learning process of the student model through the predictions and feature representations of the teacher model, further improving the learning effect. Through these techniques, CompressTracker not only significantly accelerates model inference while maintaining high accuracy but also has broad applicability and can be applied to any Transformer architecture. Experimental results show that CompressTracker performs excellently on multiple benchmark datasets. For example, CompressTracker-4 retains about 96% of the original performance on the LaSOT dataset while achieving a 2.17x acceleration.