EMTrack: Efficient Multimodal Object Tracking

Chang Liu,Ziqi Guan,Simiao Lai,Yang Liu,Huchuan Lu,Dong Wang
DOI: https://doi.org/10.1109/tcsvt.2024.3494725
IF: 5.859
2024-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:Multi-modal object tracking has received increasing attention, given the limitations the representation ability in certain challenging scenarios of single RGB modality. Recent prompt tuning techniques enable multimodal tracking to effectively inherit knowledge from foundation models trained with a large amount of RGB tracking data and achieve parameter-efficient training. However, few works focus on the efficient inference of multimodal tracking handling multiple RGB-X (RGB-Thermal, RGB-Depth, RGB-Event, etc.) tracking tasks simultaneously, especially on resource-limited devices such as CPU. In this work, we propose an efficient multimodal tracker named EMTrack. EMTrack follows a concise and unified multimodal tracking framework with simple knowledge distillation. RGB modality and auxiliary modality are added after patch-embedding layer for fusion, reducing the computational complexity of multimodal tracking compared with that of single modality. Before fusion operation, we introduce a modal-specific spatial modulation module to exploit and realize adaptive spatial adjustment of different modality features. Multiple modal-specific experts are adopted to capture specific information for different RGB-X tracking tasks, which assists in handling such tasks in a unified model with joint training. EMTrack achieves competitive performance on various RGB-X tracking benchmarks while reaching a good balance of performance and speed on different platforms. Especially on an Intel Core i9-10850K CPU device, EMTrack achieves 29.1 fps, a real-time speed, with only 2.0G MAC computation.
What problem does this paper attempt to address?