SVASTIN: Sparse Video Adversarial Attack via Spatio-Temporal Invertible Neural Networks

Yi Pan,Jun-Jie Huang,Zihan Chen,Wentao Zhao,Ziyue Wang
2024-06-04
Abstract:Robust and imperceptible adversarial video attack is challenging due to the spatial and temporal characteristics of videos. The existing video adversarial attack methods mainly take a gradient-based approach and generate adversarial videos with noticeable perturbations. In this paper, we propose a novel Sparse Adversarial Video Attack via Spatio-Temporal Invertible Neural Networks (SVASTIN) to generate adversarial videos through spatio-temporal feature space information exchanging. It consists of a Guided Target Video Learning (GTVL) module to balance the perturbation budget and optimization speed and a Spatio-Temporal Invertible Neural Network (STIN) module to perform spatio-temporal feature space information exchanging between a source video and the target feature tensor learned by GTVL module. Extensive experiments on UCF-101 and Kinetics-400 demonstrate that our proposed SVASTIN can generate adversarial examples with higher imperceptibility than the state-of-the-art methods with the higher fooling rate. Code is available at \href{<a class="link-external link-https" href="https://github.com/Brittany-Chen/SVASTIN" rel="external noopener nofollow">this https URL</a>}{<a class="link-external link-https" href="https://github.com/Brittany-Chen/SVASTIN" rel="external noopener nofollow">this https URL</a>}.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to generate adversarial video attacks that are robust and imperceptible to deep neural networks (DNN). Specifically, existing video adversarial attack methods mainly adopt gradient - based methods, and the generated adversarial videos have obvious perturbations, making it difficult to simultaneously ensure the success rate of the attack and visual imperceptibility. ### Main problems: 1. **Spatial and temporal characteristics**: Videos contain spatial and temporal dimensions, which pose challenges to the generation of robust and effective adversarial videos. 2. **Limitations of existing methods**: When generating adversarial videos, existing video adversarial attack methods often produce obviously visible perturbations, which affect the concealment and success rate of the attack. ### Solutions: To solve the above problems, the author proposes a new method - **Sparse Video Adversarial Attack via Spatio - Temporal Invertible Neural Networks (SV ASTIN)**. This method is implemented through the following two modules: 1. **Guided Target Video Learning (GTVL) module**: - It is used to balance the perturbation budget and optimize the speed. - It learns a target feature tensor to guide the generation of adversarial videos. 2. **Spatio - Temporal Invertible Neural Network (STIN) module**: - It performs spatio - temporal feature space information exchange, and uses 3D discrete wavelet transform (3D - DWT) and spatio - temporal affine coupling blocks (ST - ACB) to capture and process spatio - temporal information. - It constrains perturbations to be added only to the high - frequency coefficients of 3D - DWT, thereby improving the imperceptibility of adversarial videos. ### Experimental results: The experimental results show that the adversarial videos generated by the SV ASTIN method on the Kinetics - 400 and UCF - 101 datasets not only have a higher fooling rate, but are also more visually imperceptible, performing better than existing methods. ### Summary: By introducing the STIN and GTVL modules, this paper solves the difficult problem of generating robust and imperceptible adversarial videos, significantly improving the quality and attack effect of adversarial videos.