Replace Anyone in Videos

Xiang Wang,Changxin Gao,Yuehuan Wang,Nong Sang
2024-09-30
Abstract:Recent advancements in controllable human-centric video generation, particularly with the rise of diffusion models, have demonstrated considerable progress. However, achieving precise and localized control over human motion, e.g., replacing or inserting individuals into videos while exhibiting desired motion patterns, still remains challenging. In this work, we propose the ReplaceAnyone framework, which focuses on localizing and manipulating human motion in videos with diverse and intricate backgrounds. Specifically, we formulate this task as an image-conditioned pose-driven video inpainting paradigm, employing a unified video diffusion architecture that facilitates image-conditioned pose-driven video generation and inpainting within masked video regions. Moreover, we introduce diverse mask forms involving regular and irregular shapes to avoid shape leakage and allow granular local control. Additionally, we implement a two-stage training methodology, initially training an image-conditioned pose driven video generation model, followed by joint training of the video inpainting within masked areas. In this way, our approach enables seamless replacement or insertion of characters while maintaining the desired pose motion and reference appearance within a single framework. Experimental results demonstrate the effectiveness of our method in generating realistic and coherent video content.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: in dynamic video scenes, how to achieve accurate local replacement and insertion of human characters while maintaining the required pose movement and reference appearance. Specifically, existing methods face challenges in the following aspects: 1. **Precise control**: Existing controllable human video generation methods are difficult to achieve precise control of local human movements in complex dynamic backgrounds. 2. **Seamless fusion**: When inserting a new character into a video or replacing an existing character, it is necessary to ensure that the new content is seamlessly fused with the background and does not change the content of the unmasked area. 3. **Pose and appearance consistency**: Ensure that the inserted or replaced character not only conforms to the specified position, but also can follow the visual features of the reference image and the movement indicated by the driving pose sequence. To solve these problems, the author proposes a framework named **ReplaceAnyone**, which realizes the local replacement and insertion of human characters in videos through a unified image - conditioned pose - driven video inpainting paradigm. The following are the main contributions of this framework: - **Unified framework**: Integrates the image - conditioned pose - driven video generation and masked video inpainting tasks, enabling both tasks to be processed simultaneously. - **Diverse mask forms**: Designs multiple mask forms (such as precise masks, rectangular - boundary masks, dilated masks, and mixed masks) to avoid shape - information leakage and achieve fine - grained local control. - **Two - stage training strategy**: First, train an image - conditioned pose - driven video generation model, and then jointly train the video inpainting task on this basis, thereby reducing the training difficulty and improving the model performance. Through these technical means, ReplaceAnyone achieves efficient local manipulation of human characters in dynamic video scenes, demonstrating its great application potential in the field of video synthesis.