Abstract:Recent years have witnessed increasing interest in adversarial attacks on images, while adversarial video attacks have seldom been explored. In this paper, we propose a sparse adversarial attack strategy on videos (DeepSAVA). Our model aims to add a small human-imperceptible perturbation to the key frame of the input video to fool the classifiers. To carry out an effective attack that mirrors real-world scenarios, our algorithm integrates spatial transformation perturbations into the frame. Instead of using the lp norm to gauge the disparity between the perturbed frame and the original frame, we employ the structural similarity index (SSIM), which has been established as a more suitable metric for quantifying image alterations resulting from spatial perturbations. We employ a unified optimisation framework to combine spatial transformation with additive perturbation, thereby attaining a more potent attack. We design an effective and novel optimisation scheme that alternatively utilises Bayesian Optimisation (BO) to identify the most critical frame in a video and stochastic gradient descent (SGD) based optimisation to produce both additive and spatial-transformed perturbations. Doing so enables DeepSAVA to perform a very sparse attack on videos for maintaining human imperceptibility while still achieving state-of-the-art performance in terms of both attack success rate and adversarial transferability. Furthermore, built upon the strong perturbations produced by DeepSAVA, we design a novel adversarial training framework to improve the robustness of video classification models. Our intensive experiments on various types of deep neural networks and video datasets confirm the superiority of DeepSAVA in terms of attacking performance and efficiency. When compared to the baseline techniques, DeepSAVA exhibits the highest level of performance in generating adversarial videos for three distinct video classifiers. Remarkably, it achieves an impressive fooling rate ranging from 99.5% to 100% for the I3D model, with the perturbation of just a single frame. Additionally, DeepSAVA demonstrates favourable transferability across various time series models. The proposed adversarial training strategy is also empirically demonstrated with better performance on training robust video classifiers compared with the state-of-the-art adversarial training with projected gradient descent (PGD) adversary.

SS-CMT: a label independent cross-modal transferable adversarial video attack with sparse strategy

Adaptive Cross-Modal Transferable Adversarial Attacks From Images to Videos

Boosting the Transferability of Video Adversarial Examples Via Temporal Translation.

Enhancing robustness in video recognition models: Sparse adversarial attacks and beyond

Mutual-modality Adversarial Attack with Semantic Perturbation

Sparse Adversarial Perturbations for Videos

Towards Decision-based Sparse Attacks on Video Recognition

Adaptive momentum variance for attention-guided sparse adversarial attacks

Cascade & allocate: A cross-structure adversarial attack against models fusing vision and language

An Optimized Transfer Attack Framework Towards Multi-Modal Machine Learning

Transferable Adversarial Attacks for Image and Video Object Detection

Imperceptible Adversarial Attack with Multi-granular Spatio-temporal Attention for Video Action Recognition

Improving transferability of 3D adversarial attacks with scale and shear transformations

Towards Transferable Adversarial Attacks on Image and Video Transformers

Towards Transferable Unrestricted Adversarial Examples with Minimum Changes

SVASTIN: Sparse Video Adversarial Attack via Spatio-Temporal Invertible Neural Networks

Transfer Attacks Revisited: A Large-Scale Empirical Study in Real Computer Vision Settings

Rethinking Model Ensemble in Transfer-based Adversarial Attacks

Sparse and Transferable Universal Singular Vectors Attack

Coreset Learning Based Sparse Black-box Adversarial Attack For Video Recognition

GCSA: A New Adversarial Example-Generating Scheme Towards Black-Box Adversarial Attacks