Context-Guided Black-Box Attack for Visual Tracking

Xingsen Huang,Deshui Miao,Hongpeng Wang,Yaowei Wang,Xin Li
DOI: https://doi.org/10.1109/tmm.2024.3382473
IF: 7.3
2024-01-01
IEEE Transactions on Multimedia
Abstract:With the recent advancement of deep neural networks, visual tracking has achieved substantial progress in tracking accuracy. However, the robustness and security of tracking methods developed based on current deep models have not been thoroughly explored, a critical consideration for real-world applications. In this study, we propose a context-guided black-box attack method to investigate the robustness of recent advanced deep trackers against spatial and temporal interference. For spatial interference, the proposed algorithm generates adversarial target samples by mixing the information of the target object and the similar background regions around it in an embedded feature space of an encoder-decoder model, which evaluates the ability of trackers to handle background distractors. For temporal interference, we use the target state in the previous frame to generate the adversarial sample, which easily fools the trackers that rely too heavily on tracking prior assumptions, such as that the appearance changes and movements of a video target object are small between two consecutive frames. We assess the proposed attack method under both CNN-based and transformer-based tracking frameworks on four diverse datasets: OTB100, VOT2018, GOT-10k, and LaSOT. The experimental results demonstrate that our approach substantially deteriorates the performance of all these deep trackers across numerous datasets, even in the black-box attack mode. This reveals the weak robustness of recent deep tracking methods against background distractors and prior dependencies.
What problem does this paper attempt to address?