AMG: Avatar Motion Guided Video Generation

Zhangsihao Yang,Mengyi Shan,Mohammad Farazi,Wenhui Zhu,Yanxi Chen,Xuanzhao Dong,Yalin Wang
2024-09-03
Abstract:Human video generation task has gained significant attention with the advancement of deep generative models. Generating realistic videos with human movements is challenging in nature, due to the intricacies of human body topology and sensitivity to visual artifacts. The extensively studied 2D media generation methods take advantage of massive human media datasets, but struggle with 3D-aware control; whereas 3D avatar-based approaches, while offering more freedom in control, lack photorealism and cannot be harmonized seamlessly with background scene. We propose AMG, a method that combines the 2D photorealism and 3D controllability by conditioning video diffusion models on controlled rendering of 3D avatars. We additionally introduce a novel data processing pipeline that reconstructs and renders human avatar movements from dynamic camera videos. AMG is the first method that enables multi-person diffusion video generation with precise control over camera positions, human motions, and background style. We also demonstrate through extensive evaluation that it outperforms existing human video generation methods conditioned on pose sequences or driving videos in terms of realism and adaptability.
Computer Vision and Pattern Recognition,Artificial Intelligence,Graphics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to achieve both high controllability and photo - realistic quality simultaneously when generating realistic human videos. Specifically, although existing 2D methods can utilize large - scale human media datasets, they have limitations in 3D - aware control; while 3D methods offer more degrees of freedom for control, they lack photo - realistic quality and are difficult to seamlessly integrate with background scenes. Therefore, the paper proposes a new method, AMG (Avatar Motion Guided Video Generation), aiming to combine the realism of 2D and the controllability of 3D, and generate videos of multiple characters through a conditional video diffusion model while precisely controlling camera positions, character actions, and background styles. ### Main Contributions 1. **Proposing a new training paradigm**: This paradigm combines the advantages of 2D pre - trained generative models and 3D animatable avatars, achieving high realism and fine - grained controllability. 2. **Introducing a data collection pipeline**: This pipeline extracts 3D motion and camera information from 2D human videos and renders the corresponding 3D avatars as conditional signals. The research team plans to release the processed dataset. 3. **Proposing an efficient fine - tuning method**: This method is used for pre - trained text - to - video models and is efficiently fine - tuned through video - conditional parameters. 4. **Verifying the effectiveness, robustness, and superiority of the method through multiple baseline methods**: In terms of controllability and photo - realistic quality, it is evaluated using CLIP scores and motion accuracy. ### Method Overview 1. **Data Generation**: - **Extracting SMPL Poses**: Use TRACE to detect, track, and extract SMPL parameters from videos. - **Generating Appearance Descriptions**: Use LLaVA to generate text prompts that describe the appearance of characters in videos. - **Animating 3DGS**: Use HumanGaussian to generate 3D avatars and drive these avatars according to the extracted SMPL parameters. - **Generating Scene Descriptions**: Use LLaVA to generate text prompts that describe video scenes. 2. **Model Video - conditional Fine - tuning**: - **Constructing a Conditional Diffusion Model**: Based on ModelScopeT2V, introduce an additional condition \( z_a = E(V_a) \) and fine - tune it through LoRA. - **Fine - tuning Strategy**: Use the pre - trained ModelScopeT2V to initialize weights, freeze non - LoRA components, and only update LoRA components. 3. **Inference**: - **Generating Interactive Actions**: Use InterGen to generate SMPL actions for multi - person interactions. - **Rendering 3D Avatars**: Generate 3D avatars according to the generated actions and appearance prompts and control the camera to render detailed actions. - **Generating Videos**: Use the rendered motion frames as conditions to generate realistic videos that match the appearance and actions. ### Experimental Results - **Qualitative Evaluation**: Demonstrates the effect of the model in generating novel actions outside the scope of training data, controlling camera movements, and modifying backgrounds. - **Quantitative Evaluation**: The method is evaluated using CLIP scores and motion fidelity scores, and the results show that AMG is superior to existing methods in terms of controllability and realism. In conclusion, by combining the advantages of 2D and 3D methods, this paper proposes a new video generation method, AMG, which can achieve a high degree of controllability while generating realistic videos.