FreeMask: Rethinking the Importance of Attention Masks for Zero-Shot Video Editing

Lingling Cai,Kang Zhao,Hangjie Yuan,Yingya Zhang,Shiwei Zhang,Kejie Huang
2024-10-01
Abstract:Text-to-video diffusion models have made remarkable advancements. Driven by their ability to generate temporally coherent videos, research on zero-shot video editing using these fundamental models has expanded rapidly. To enhance editing quality, structural controls are frequently employed in video editing. Among these techniques, cross-attention mask control stands out for its effectiveness and efficiency. However, when cross-attention masks are naively applied to video editing, they can introduce artifacts such as blurring and flickering. Our experiments uncover a critical factor overlooked in previous video editing research: cross-attention masks are not consistently clear but vary with model structure and denoising timestep. To address this issue, we propose the metric Mask Matching Cost (MMC) that quantifies this variability and propose FreeMask, a method for selecting optimal masks tailored to specific video editing tasks. Using MMC-selected masks, we further improve the masked fusion mechanism within comprehensive attention features, e.g., temp, cross, and self-attention modules. Our approach can be seamlessly integrated into existing zero-shot video editing frameworks with better performance, requiring no control assistance or parameter fine-tuning but enabling adaptive decoupling of unedited semantic layouts with mask precision control. Extensive experiments demonstrate that FreeMask achieves superior semantic fidelity, temporal consistency, and editing quality compared to state-of-the-art methods.
Computer Vision and Pattern Recognition,Multimedia
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the problems encountered when using attention masks in zero - shot video editing. Specifically, the author points out the following key issues: 1. **Limitations of existing methods**: - Current methods are prone to introduce artifacts such as blurring and flickering when using cross - attention masks for video editing. These artifacts reduce the editing quality. - Existing research ignores that cross - attention masks are not always clear and consistent, but change with the model structure and denoising time steps. 2. **Lack of effective quantitative indicators**: - There is a lack of an effective method to quantify and select the best attention mask for a specific video editing task. 3. **Deficiencies in the fusion mechanism**: - The existing feature fusion mechanism has difficulty in finding the optimal fusion ratio when processing the features of the source video and the edited video, resulting in structural distortion or an identical video. To solve these problems, the author proposes the FreeMask method, and its main contributions include: - **Introducing Mask Matching Cost (MMC)**: This is a metric based on MIoU, which is used to quantify the accuracy of attention masks at different layers and time steps. Through MMC, masks suitable for specific editing tasks can be selected more accurately. - **Optimized mask fusion mechanism**: Using the masks selected by MMC, the mask fusion mechanism in the comprehensive attention features (such as time, cross - and self - attention modules) is improved, thereby improving the editing quality. - **No need for additional control or parameter fine - tuning**: FreeMask can be seamlessly integrated into the existing zero - shot video editing framework without the need for additional control assistance or parameter fine - tuning, achieving adaptive decoupling of the unedited semantic layout and mask accuracy control. Overall, this paper rethinks the importance of attention masks and proposes a new method, FreeMask, to improve the quality and consistency of zero - shot video editing.