ReToMe-VA: Recursive Token Merging for Video Diffusion-based Unrestricted Adversarial Attack

Ziyi Gao,Kai Chen,Zhipeng Wei,Tingshu Mou,Jingjing Chen,Zhiyu Tan,Hao Li,Yu-Gang Jiang
2024-08-10
Abstract:Recent diffusion-based unrestricted attacks generate imperceptible adversarial examples with high transferability compared to previous unrestricted attacks and restricted attacks. However, existing works on diffusion-based unrestricted attacks are mostly focused on images yet are seldom explored in videos. In this paper, we propose the Recursive Token Merging for Video Diffusion-based Unrestricted Adversarial Attack (ReToMe-VA), which is the first framework to generate imperceptible adversarial video clips with higher transferability. Specifically, to achieve spatial imperceptibility, ReToMe-VA adopts a Timestep-wise Adversarial Latent Optimization (TALO) strategy that optimizes perturbations in diffusion models' latent space at each denoising step. TALO offers iterative and accurate updates to generate more powerful adversarial frames. TALO can further reduce memory consumption in gradient computation. Moreover, to achieve temporal imperceptibility, ReToMe-VA introduces a Recursive Token Merging (ReToMe) mechanism by matching and merging tokens across video frames in the self-attention module, resulting in temporally consistent adversarial videos. ReToMe concurrently facilitates inter-frame interactions into the attack process, inducing more diverse and robust gradients, thus leading to better adversarial transferability. Extensive experiments demonstrate the efficacy of ReToMe-VA, particularly in surpassing state-of-the-art attacks in adversarial transferability by more than 14.16% on average.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: Existing diffusion models mainly focus on images when generating adversarial attacks, while there is less exploration in videos. Although diffusion - based unrestricted attacks can generate imperceptible and highly transferable adversarial samples, most of these methods are limited to static images and have not been fully applied to the video field. Therefore, this paper proposes a new framework - Recursive Token Merging for Unrestricted Adversarial Attacks in Video Diffusion (ReToMe - VA), aiming to generate adversarial video clips that are imperceptible both spatially and temporally and have higher transferability. Specifically, this paper addresses the following key issues: 1. **Spatial imperceptibility**: Traditional adversarial attack methods, when dealing with videos, are prone to cause semantic inconsistency between frames, which in turn affects the temporal consistency of the video. To solve this problem, the paper introduces the **Timestep - wise Adversarial Latent Optimization (TALO)** strategy, which gradually optimizes the perturbation in each denoising step to ensure that the generated adversarial frames are as close as possible to the original frames in space. 2. **Temporal imperceptibility**: Directly perturbing each frame separately will lead to discontinuous motion and temporal inconsistency. For this reason, the paper proposes the **Recursive Token Merging (ReToMe)** mechanism. By matching and merging tokens across frames, the self - attention module can extract consistent features, thereby generating temporally consistent adversarial videos. 3. **Memory consumption and computational efficiency**: Traditional methods involve a large amount of gradient calculation throughout the denoising process, resulting in high memory usage. The TALO strategy only performs one gradient calculation at each time step, significantly reducing memory consumption. 4. **Adversarial transferability**: To improve the transferability of adversarial samples, the ReToMe mechanism promotes inter - frame interaction, making the gradient update more diverse and robust, thereby enhancing the ability of adversarial samples to deceive different target models. In summary, the main objective of this paper is to develop an effective framework that can generate both imperceptible and highly transferable adversarial samples in the video field to address the shortcomings of existing methods in video adversarial attacks.