Abstract:Recent diffusion-based unrestricted attacks generate imperceptible adversarial examples with high transferability compared to previous unrestricted attacks and restricted attacks. However, existing works on diffusion-based unrestricted attacks are mostly focused on images yet are seldom explored in videos. In this paper, we propose the Recursive Token Merging for Video Diffusion-based Unrestricted Adversarial Attack (ReToMe-VA), which is the first framework to generate imperceptible adversarial video clips with higher transferability. Specifically, to achieve spatial imperceptibility, ReToMe-VA adopts a Timestep-wise Adversarial Latent Optimization (TALO) strategy that optimizes perturbations in diffusion models' latent space at each denoising step. TALO offers iterative and accurate updates to generate more powerful adversarial frames. TALO can further reduce memory consumption in gradient computation. Moreover, to achieve temporal imperceptibility, ReToMe-VA introduces a Recursive Token Merging (ReToMe) mechanism by matching and merging tokens across video frames in the self-attention module, resulting in temporally consistent adversarial videos. ReToMe concurrently facilitates inter-frame interactions into the attack process, inducing more diverse and robust gradients, thus leading to better adversarial transferability. Extensive experiments demonstrate the efficacy of ReToMe-VA, particularly in surpassing state-of-the-art attacks in adversarial transferability by more than 14.16% on average.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: Existing diffusion models mainly focus on images when generating adversarial attacks, while there is less exploration in videos. Although diffusion - based unrestricted attacks can generate imperceptible and highly transferable adversarial samples, most of these methods are limited to static images and have not been fully applied to the video field. Therefore, this paper proposes a new framework - Recursive Token Merging for Unrestricted Adversarial Attacks in Video Diffusion (ReToMe - VA), aiming to generate adversarial video clips that are imperceptible both spatially and temporally and have higher transferability. Specifically, this paper addresses the following key issues: 1. **Spatial imperceptibility**: Traditional adversarial attack methods, when dealing with videos, are prone to cause semantic inconsistency between frames, which in turn affects the temporal consistency of the video. To solve this problem, the paper introduces the **Timestep - wise Adversarial Latent Optimization (TALO)** strategy, which gradually optimizes the perturbation in each denoising step to ensure that the generated adversarial frames are as close as possible to the original frames in space. 2. **Temporal imperceptibility**: Directly perturbing each frame separately will lead to discontinuous motion and temporal inconsistency. For this reason, the paper proposes the **Recursive Token Merging (ReToMe)** mechanism. By matching and merging tokens across frames, the self - attention module can extract consistent features, thereby generating temporally consistent adversarial videos. 3. **Memory consumption and computational efficiency**: Traditional methods involve a large amount of gradient calculation throughout the denoising process, resulting in high memory usage. The TALO strategy only performs one gradient calculation at each time step, significantly reducing memory consumption. 4. **Adversarial transferability**: To improve the transferability of adversarial samples, the ReToMe mechanism promotes inter - frame interaction, making the gradient update more diverse and robust, thereby enhancing the ability of adversarial samples to deceive different target models. In summary, the main objective of this paper is to develop an effective framework that can generate both imperceptible and highly transferable adversarial samples in the video field to address the shortcomings of existing methods in video adversarial attacks.

ReToMe-VA: Recursive Token Merging for Video Diffusion-based Unrestricted Adversarial Attack

Boosting the Transferability of Video Adversarial Examples Via Temporal Translation.

VidToMe: Video Token Merging for Zero-Shot Video Editing

Enhancing robustness in video recognition models: Sparse adversarial attacks and beyond

Diffusion Models for Imperceptible and Transferable Adversarial Attack

TempMe: Video Temporal Token Merging for Efficient Text-Video Retrieval

Efficient Generation of Targeted and Transferable Adversarial Examples for Vision-Language Models Via Diffusion Models

Sparse Adversarial Perturbations for Videos

Highly Transferable Diffusion-based Unrestricted Adversarial Attack on Pre-trained Vision-Language Models

The Central Limit Theorem for the Normalized Sums of the MAI for SSMA Communication Systems Using Spreading Sequences of Markov Chains

Efficient Decision-based Black-box Patch Attacks on Video Recognition

Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation

Temporal-Distributed Backdoor Attack Against Video Based Action Recognition

Imperceptible Adversarial Attack with Multi-granular Spatio-temporal Attention for Video Action Recognition

Adaptive Temporal Grouping for Black-box Adversarial Attacks on Videos

Adv-Diffusion: Imperceptible Adversarial Face Identity Attack via Latent Diffusion Model

Towards Decision-based Sparse Attacks on Video Recognition

A [72, 36, 16] doubly even code does not have an automorphism of order 11

SVASTIN: Sparse Video Adversarial Attack via Spatio-Temporal Invertible Neural Networks

Rethinking Video Deblurring with Wavelet-Aware Dynamic Transformer and Diffusion Model

An Adaptive Model Ensemble Adversarial Attack for Boosting Adversarial Transferability