DMTrack: learning deformable masked visual representations for single object tracking

Omar Abdelaziz,Mohamed Shehata
DOI: https://doi.org/10.1007/s11760-024-03713-0
IF: 1.583
2024-12-06
Signal Image and Video Processing
Abstract:Single object tracking is still challenging because it requires localizing an arbitrary object in a sequence of frames, given only its appearance in the first frame of the sequence. Many trackers, especially those leveraging the Vision Transformer (ViT) backbone, have achieved superior performance. However, the gap between the performance metrics measured on the training data and those on the test data is still large. To alleviate this issue, we propose the deformable masking module in the transformer-based trackers. The deformable masking module is injected after each layer of the ViT. First, It masks out complete vectors of the output representations of the ViT layer. After that, the masked representations are fed into a deformable convolution to reconstruct new reliable representations. The output of the last layer of the ViT is modified by fusing it with all intermediate outputs of the deformable masking modules to produce a final robust attentional feature map. We extensively evaluate the performance of our model, named DMTrack, on seven different tracking benchmarks. Our model outperforms the previous state-of-the-art techniques by ( ) while having fewer parameters ( ). Moreover, our model matches the performance of much larger models in terms of parameters, indicating our training strategy's effectiveness.
engineering, electrical & electronic,imaging science & photographic technology
What problem does this paper attempt to address?