Adaptive Temporal Grouping for Black-box Adversarial Attacks on Videos

Zhipeng Wei,Jingjing Chen,Hao Zhang,Linxi Jiang,Yu-Gang Jiang
DOI: https://doi.org/10.1145/3512527.3531411
2022-01-01
Abstract:Deep-learning based video models, which have remarkable performance on action recognition tasks, are recently proved to be vulnerable to adversarial samples, even those generated in the black-box setting. However, these black-box attack methods are insufficient to attack videos models in real-world applications due to the requirement of lots of queries. To this end, we propose to boost the efficiency of black-box attacks on video recognition models. Although videos carry rich temporal information, they include redundant spatial information from adjacent frames. This motivates us to introduce the adaptive temporal grouping (ATG) method, which groups video frames by the similarity of their features extracted from the ImageNet-pretrained image model. By selecting one key-frame from each group, ATG helps any black-box attack methods to optimize the adversarial perturbations over key-frames instead of all frames, where the estimated gradient of key-frame is shared with other frames in each group. To balance the efficiency and precision of estimated gradients, ATG adaptively adjusts the group number by the magnitude of the current perturbation and the current query number. Through extensive experiments on the HMDB-51 dataset and the UCF-101 dataset, we demonstrate that ATG can significantly reduce the number of queries by more than 10% for the targeted attack.
What problem does this paper attempt to address?