Abstract:Transformer-based trackers have established a dominant role in the field of visual object tracking. While these trackers exhibit promising performance, their deployment on resource-constrained devices remains challenging due to inefficiencies. To improve the inference efficiency and reduce the computation cost, prior approaches have aimed to either design lightweight trackers or distill knowledge from larger teacher models into more compact student trackers. However, these solutions often sacrifice accuracy for speed. Thus, we propose a general model compression framework for efficient transformer object tracking, named CompressTracker, to reduce the size of a pre-trained tracking model into a lightweight tracker with minimal performance degradation. Our approach features a novel stage division strategy that segments the transformer layers of the teacher model into distinct stages, enabling the student model to emulate each corresponding teacher stage more effectively. Additionally, we also design a unique replacement training technique that involves randomly substituting specific stages in the student model with those from the teacher model, as opposed to training the student model in isolation. Replacement training enhances the student model's ability to replicate the teacher model's behavior. To further forcing student model to emulate teacher model, we incorporate prediction guidance and stage-wise feature mimicking to provide additional supervision during the teacher model's compression process. Our framework CompressTracker is structurally agnostic, making it compatible with any transformer architecture. We conduct a series of experiment to verify the effectiveness and generalizability of CompressTracker. Our CompressTracker-4 with 4 transformer layers, which is compressed from OSTrack, retains about 96% performance on LaSOT (66.1% AUC) while achieves 2.17x speed up.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve The paper aims to address the efficiency issues of deploying Transformer-based visual object tracking models on resource-constrained devices. Although existing Transformer-based trackers perform excellently in terms of performance, their high computational cost and low inference efficiency limit their widespread use in practical applications. To improve the inference efficiency of these models and reduce computational costs, existing methods typically achieve this by designing lightweight trackers or distilling the knowledge of large teacher models into smaller student models. However, these methods often trade accuracy for speed. Therefore, the authors propose a general model compression framework—CompressTracker, which efficiently compresses pre-trained Transformer tracking models into lightweight trackers while minimizing performance degradation. Specifically, CompressTracker achieves this goal through the following techniques: 1. **Stage Division Strategy**: Dividing the Transformer layers of the teacher model into multiple stages, allowing the student model to more effectively mimic the behavior of each corresponding stage. 2. **Replacement Training Technique**: Randomly replacing specific stages of the student model with the corresponding stages of the teacher model during training, enhancing the student model's ability to replicate the teacher model's behavior. 3. **Prediction Guidance and Stage Feature Imitation**: Supervising the learning process of the student model through the predictions and feature representations of the teacher model, further improving the learning effect. Through these techniques, CompressTracker not only significantly accelerates model inference while maintaining high accuracy but also has broad applicability and can be applied to any Transformer architecture. Experimental results show that CompressTracker performs excellently on multiple benchmark datasets. For example, CompressTracker-4 retains about 96% of the original performance on the LaSOT dataset while achieving a 2.17x acceleration.

General Compression Framework for Efficient Transformer Object Tracking

Transformer Union Convolution Network for Visual Object Tracking

Compact Transformer Tracker with Correlative Masked Modeling.

Exploring Dynamic Transformer for Efficient Object Tracking

Adaptive sparse attention-based compact transformer for object tracking

Efficient transformer tracking with adaptive attention

Lightweight Transformer Tracker: Compact and Effect Neural Network for Object Tracking with Long-Short Range Attention

Target-aware transformer tracking with hard occlusion instance generation

A transformer‐based lightweight method for multiple‐object tracking

Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking

ZoomTrack: Target-aware Non-uniform Resizing for Efficient Visual Tracking

High-Performance Transformer Tracking

Correlation-Embedded Transformer Tracking: A Single-Branch Framework

TransCenter: Transformers With Dense Representations for Multiple-Object Tracking

STFT: Spatial and Temporal Feature Fusion for Transformer Tracker

Optimized Information Flow for Transformer Tracking

Visual tracking using transformer with a combination of convolution and attention

Transforming Model Prediction for Tracking

TrTr: Visual Tracking with Transformer

RTSformer: A Robust Toroidal Transformer With Spatiotemporal Features for Visual Tracking

Joint Spatial-Temporal and Appearance Modeling with Transformer for Multiple Object Tracking