Abstract:Text-to-video diffusion models have made remarkable advancements. Driven by their ability to generate temporally coherent videos, research on zero-shot video editing using these fundamental models has expanded rapidly. To enhance editing quality, structural controls are frequently employed in video editing. Among these techniques, cross-attention mask control stands out for its effectiveness and efficiency. However, when cross-attention masks are naively applied to video editing, they can introduce artifacts such as blurring and flickering. Our experiments uncover a critical factor overlooked in previous video editing research: cross-attention masks are not consistently clear but vary with model structure and denoising timestep. To address this issue, we propose the metric Mask Matching Cost (MMC) that quantifies this variability and propose FreeMask, a method for selecting optimal masks tailored to specific video editing tasks. Using MMC-selected masks, we further improve the masked fusion mechanism within comprehensive attention features, e.g., temp, cross, and self-attention modules. Our approach can be seamlessly integrated into existing zero-shot video editing frameworks with better performance, requiring no control assistance or parameter fine-tuning but enabling adaptive decoupling of unedited semantic layouts with mask precision control. Extensive experiments demonstrate that FreeMask achieves superior semantic fidelity, temporal consistency, and editing quality compared to state-of-the-art methods.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the problems encountered when using attention masks in zero - shot video editing. Specifically, the author points out the following key issues: 1. **Limitations of existing methods**: - Current methods are prone to introduce artifacts such as blurring and flickering when using cross - attention masks for video editing. These artifacts reduce the editing quality. - Existing research ignores that cross - attention masks are not always clear and consistent, but change with the model structure and denoising time steps. 2. **Lack of effective quantitative indicators**: - There is a lack of an effective method to quantify and select the best attention mask for a specific video editing task. 3. **Deficiencies in the fusion mechanism**: - The existing feature fusion mechanism has difficulty in finding the optimal fusion ratio when processing the features of the source video and the edited video, resulting in structural distortion or an identical video. To solve these problems, the author proposes the FreeMask method, and its main contributions include: - **Introducing Mask Matching Cost (MMC)**: This is a metric based on MIoU, which is used to quantify the accuracy of attention masks at different layers and time steps. Through MMC, masks suitable for specific editing tasks can be selected more accurately. - **Optimized mask fusion mechanism**: Using the masks selected by MMC, the mask fusion mechanism in the comprehensive attention features (such as time, cross - and self - attention modules) is improved, thereby improving the editing quality. - **No need for additional control or parameter fine - tuning**: FreeMask can be seamlessly integrated into the existing zero - shot video editing framework without the need for additional control assistance or parameter fine - tuning, achieving adaptive decoupling of the unedited semantic layout and mask accuracy control. Overall, this paper rethinks the importance of attention masks and proposes a new method, FreeMask, to improve the quality and consistency of zero - shot video editing.

FreeMask: Rethinking the Importance of Attention Masks for Zero-Shot Video Editing

Delving Deeper into Mask Utilization in Video Object Segmentation

Motion Guided Token Compression for Efficient Masked Video Modeling

Blended Latent Diffusion under Attention Control for Real-World Video Editing

MaskINT: Video Editing via Interpolative Non-autoregressive Masked Transformers

Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models

RealCraft: Attention Control as A Tool for Zero-Shot Consistent Video Editing

VidToMe: Video Token Merging for Zero-Shot Video Editing

Investigating the Effectiveness of Cross-Attention to Unlock Zero-Shot Editing of Text-to-Video Diffusion Models

Video Abstraction via Attention Model and On-Line Clustering

MaTe3D: Mask-guided Text-based 3D-aware Portrait Editing

Zero-Shot Video Editing through Adaptive Sliding Score Distillation

Video Editing with Temporal, Spatial and Appearance Consistency

FreeEdit: Mask-free Reference-based Image Editing with Multi-modal Instruction

Rethinking Video Segmentation with Masked Video Consistency: Did the Model Learn as Intended?

LOVECon: Text-driven Training-Free Long Video Editing with ControlNet

Learning Mask-aware CLIP Representations for Zero-Shot Segmentation

EVA: Zero-shot Accurate Attributes and Multi-Object Video Editing

Zero-shot Image Editing with Reference Imitation

Mask to reconstruct: Cooperative Semantics Completion for Video-text Retrieval

InFusion: Inject and Attention Fusion for Multi Concept Zero-Shot Text-based Video Editing