Abstract:Diffusion-based generative models have recently shown remarkable image and video editing capabilities. However, local video editing, particularly removal of small attributes like glasses, remains a challenge. Existing methods either alter the videos excessively, generate unrealistic artifacts, or fail to perform the requested edit consistently throughout the video. In this work, we focus on consistent and identity-preserving removal of glasses in videos, using it as a case study for consistent local attribute removal in videos. Due to the lack of paired data, we adopt a weakly supervised approach and generate synthetic imperfect data, using an adjusted pretrained diffusion model. We show that despite data imperfection, by learning from our generated data and leveraging the prior of pretrained diffusion models, our model is able to perform the desired edit consistently while preserving the original video content. Furthermore, we exemplify the generalization ability of our method to other local video editing tasks by applying it successfully to facial sticker-removal. Our approach demonstrates significant improvement over existing methods, showcasing the potential of leveraging synthetic data and strong video priors for local video editing tasks.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of consistent and identity - preserving glasses removal in videos. Specifically, the author focuses on how to consistently remove glasses in videos while keeping the identity of the person and the original content unchanged. This task is challenging because existing methods either overly change the video content and generate unrealistic artifacts, or are unable to perform the required edits consistently throughout the video. #### Main Challenges 1. **Difficulties in Local Video Editing**: Unlike static images, there are motion blurs and complex pose changes in video frames, which make local editing (such as removing small attributes) more difficult. 2. **Lack of Paired Data**: There are no readily available paired datasets that can be used to train a model for directly removing glasses. 3. **Consistency Requirements**: Edits in videos need to maintain temporal consistency, that is, glasses should be removed consistently in all frames and other parts of the content should not be affected. #### Solutions To address these challenges, the author proposes a new method named V - LASIK, and the main steps include: 1. **Synthetic Data Generation**: Due to the lack of paired data, the author uses a pre - trained diffusion model to generate paired video frames with and without glasses. In this way, they create a weakly - supervised dataset. 2. **Model Fine - Tuning**: Based on the generated synthetic data, fine - tune an image - to - image diffusion model to learn how to remove glasses from video frames while maintaining identity and other details. 3. **Video Editing Pipeline**: Combine the trained model with a pre - trained motion module to generate temporally consistent glasses - free videos. #### Technical Highlights - **Cross - Frame Attention Mechanism**: To deal with the problems of glasses reflection and occlusion, the author introduces a cross - frame attention mechanism, allowing the model to aggregate information from multiple reference frames. - **Inside - Out Normalization (ION)**: To solve the problem of inconsistent colors between the generated results and the original frames, the author proposes a new normalization method to make the statistical features inside and outside the mask area more matched. Through these technical means, V - LASIK can significantly improve the performance of existing methods in the video glasses - removal task, achieving better consistency and realism. ### Summary The core problem of this paper is to solve the problem of consistent and identity - preserving glasses removal in videos, and proposes a method that combines synthetic data generation, model fine - tuning and video editing pipeline to achieve this goal.

V-LASIK: Consistent Glasses-Removal from Videos Using Synthetic Data

Temporally Consistent Object Editing in Videos using Extended Attention

Weakly Supervised Glasses Removal

Magic Fixup: Streamlining Photo Editing by Watching Dynamic Videos

Warped Diffusion: Solving Video Inverse Problems with Image Diffusion Models

Automatic Eyeglasses Removal from Face Images

Stitch it in Time: GAN-Based Facial Editing of Real Videos

Semantically Consistent Video Inpainting with Conditional Diffusion Models

SeeClear: Semantic Distillation Enhances Pixel Condensation for Video Super-Resolution

Diffusion Video Autoencoders: Toward Temporally Consistent Face Video Editing via Disentangled Video Encoding

Video ControlNet: Towards Temporally Consistent Synthetic-to-Real Video Translation Using Conditional Image Diffusion Models

VidEdit: Zero-Shot and Spatially Aware Text-Driven Video Editing

I2VEdit: First-Frame-Guided Video Editing via Image-to-Video Diffusion Models

A Latent Transformer for Disentangled Face Editing in Images and Videos

Detecting and Removing Visual Distractors for Video Aesthetic Enhancement

Long-Term Temporally Consistent Unpaired Video Translation from Simulated Surgical 3D Data

StableVideo: Text-driven Consistency-aware Diffusion Video Editing