Replace Anyone in Videos

Xiang Wang,Changxin Gao,Yuehuan Wang,Nong Sang

2024-09-30

Abstract:Recent advancements in controllable human-centric video generation, particularly with the rise of diffusion models, have demonstrated considerable progress. However, achieving precise and localized control over human motion, e.g., replacing or inserting individuals into videos while exhibiting desired motion patterns, still remains challenging. In this work, we propose the ReplaceAnyone framework, which focuses on localizing and manipulating human motion in videos with diverse and intricate backgrounds. Specifically, we formulate this task as an image-conditioned pose-driven video inpainting paradigm, employing a unified video diffusion architecture that facilitates image-conditioned pose-driven video generation and inpainting within masked video regions. Moreover, we introduce diverse mask forms involving regular and irregular shapes to avoid shape leakage and allow granular local control. Additionally, we implement a two-stage training methodology, initially training an image-conditioned pose driven video generation model, followed by joint training of the video inpainting within masked areas. In this way, our approach enables seamless replacement or insertion of characters while maintaining the desired pose motion and reference appearance within a single framework. Experimental results demonstrate the effectiveness of our method in generating realistic and coherent video content.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: in dynamic video scenes, how to achieve accurate local replacement and insertion of human characters while maintaining the required pose movement and reference appearance. Specifically, existing methods face challenges in the following aspects: 1. **Precise control**: Existing controllable human video generation methods are difficult to achieve precise control of local human movements in complex dynamic backgrounds. 2. **Seamless fusion**: When inserting a new character into a video or replacing an existing character, it is necessary to ensure that the new content is seamlessly fused with the background and does not change the content of the unmasked area. 3. **Pose and appearance consistency**: Ensure that the inserted or replaced character not only conforms to the specified position, but also can follow the visual features of the reference image and the movement indicated by the driving pose sequence. To solve these problems, the author proposes a framework named **ReplaceAnyone**, which realizes the local replacement and insertion of human characters in videos through a unified image - conditioned pose - driven video inpainting paradigm. The following are the main contributions of this framework: - **Unified framework**: Integrates the image - conditioned pose - driven video generation and masked video inpainting tasks, enabling both tasks to be processed simultaneously. - **Diverse mask forms**: Designs multiple mask forms (such as precise masks, rectangular - boundary masks, dilated masks, and mixed masks) to avoid shape - information leakage and achieve fine - grained local control. - **Two - stage training strategy**: First, train an image - conditioned pose - driven video generation model, and then jointly train the video inpainting task on this basis, thereby reducing the training difficulty and improving the model performance. Through these technical means, ReplaceAnyone achieves efficient local manipulation of human characters in dynamic video scenes, demonstrating its great application potential in the field of video synthesis.

Replace Anyone in Videos

OAW-GAN: Occlusion-Aware Warping GAN for Unified Human Video Synthesis

Video Inpainting for Largely Occluded Moving Human

InVi: Object Insertion In Videos Using Off-the-Shelf Diffusion Models

ReCapture: Generative Video Camera Controls for User-Provided Videos using Masked Video Fine-Tuning

ReVideo: Remake a Video with Motion and Content Control

Replacement of Facial Parts in Images.

Human Motion Transfer from Poses in the Wild

SAVE: Protagonist Diversification with Structure Agnostic Video Editing

DreaMoving: A Human Video Generation Framework based on Diffusion Models

Disentangling Foreground and Background Motion for Enhanced Realism in Human Video Generation

Video-based Characters

Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation

Video Inpainting of Complex Scenes

Do as I Do: Pose Guided Human Motion Copy

Video Editing with Temporal, Spatial and Appearance Consistency

VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence

Context-Aware Talking-Head Video Editing

MaskINT: Video Editing via Interpolative Non-autoregressive Masked Transformers

Customize-A-Video: One-Shot Motion Customization of Text-to-Video Diffusion Models