TraSw: Tracklet-Switch Adversarial Attacks Against Multi-Object Tracking.

Delv Lin,Qi Chen,Chenyu Zhou,Kun He
DOI: https://doi.org/10.1016/j.asoc.2024.111860
IF: 8.7
2024-01-01
Applied Soft Computing
Abstract:Though achieving aggressive progress, there are only a few explorations on the robustness of Multi-Object Tracking (MOT) trackers. Most of the existing MOT research focuses on pedestrian tracking, yet there is little research on its adversarial attack, hindering the robustness improvement study on these systems. It is also challenging to attack these systems since various mature association algorithms have been designed to be robust against errors during the tracking. In this work, we analyze the vulnerability of typical pedestrian MOT trackers and propose a novel adversarial attack method called Tracklet-Switch Attack (TraSA) against the complete tracking pipeline. By perturbing very few frames, the proposed TraSA can spoof the advanced deep pedestrian trackers (i.e., FairMOT and ByteTrack), causing them to fail to track the targets in subsequent frames. Specifically, TraSA learns an effective perturbation generator to make the tracker confuse intersecting trajectories by attacking very few frames, then keeps the error across frames to the end of the sequences without any more perturbation. In our method, two new losses are proposed: PushPull works on the re-identification (re-ID) branch to perturb two approaching pedestrian detection boxes, while CenterLeaping works on the detection branch to perturb pedestrian features to make their trajectories switch. We conduct extensive experiments on three typical MOT-Challenge datasets and two popular trackers to show the superiority of our method. TraSA achieves 91.58%, 91.05% and 95.65% average attack success rates on 2DMOT15, MOT17, and MOT20, respectively, outperforming the runner-up by 2.64%, 14.46% and 22.67%, respectively. Meanwhile, we use a smaller number of frames, 4.11 on the average, over all datasets, while other methods use at least 6.74 average number of frames. Moreover, our method yields much lower L2 distance.
What problem does this paper attempt to address?