Abstract:Recent advancements in human video synthesis have enabled the generation of high-quality videos through the application of stable diffusion models. However, existing methods predominantly concentrate on animating solely the human element (the foreground) guided by pose information, while leaving the background entirely static. Contrary to this, in authentic, high-quality videos, backgrounds often dynamically adjust in harmony with foreground movements, eschewing stagnancy. We introduce a technique that concurrently learns both foreground and background dynamics by segregating their movements using distinct motion representations. Human figures are animated leveraging pose-based motion, capturing intricate actions. Conversely, for backgrounds, we employ sparse tracking points to model motion, thereby reflecting the natural interaction between foreground activity and environmental changes. Training on real-world videos enhanced with this innovative motion depiction approach, our model generates videos exhibiting coherent movement in both foreground subjects and their surrounding contexts. To further extend video generation to longer sequences without accumulating errors, we adopt a clip-by-clip generation strategy, introducing global features at each step. To ensure seamless continuity across these segments, we ingeniously link the final frame of a produced clip with input noise to spawn the succeeding one, maintaining narrative flow. Throughout the sequential generation process, we infuse the feature representation of the initial reference image into the network, effectively curtailing any cumulative color inconsistencies that may otherwise arise. Empirical evaluations attest to the superiority of our method in producing videos that exhibit harmonious interplay between foreground actions and responsive background dynamics, surpassing prior methodologies in this regard.

What problem does this paper attempt to address?

The paper aims to address two major issues in the field of video generation: 1. **Separation and Fusion of Dynamic Background and Foreground**: Existing video generation methods typically focus on animating the foreground (such as people) using pose information, while ignoring the dynamic changes in the background. This results in generated videos where the background appears overly static and lacks realism. Therefore, this paper proposes an innovative approach to solve this problem by separating the motion representations of the foreground and background. Specifically, foreground motion is modeled using pose information, while background motion is captured using sparse tracking points. This method enables the model to learn natural interactions between the foreground and background, generating videos with harmonious foreground actions and responsive background dynamics. 2. **Accumulated Error Problem in Long Sequence Video Generation**: To generate longer video sequences without accumulating errors, the paper proposes a segment-based generation strategy. By dividing the video into multiple small segments and generating them sequentially, global features are introduced at each step. In practice, the model combines the last frame of the previous segment with input noise to generate the next segment, ensuring the coherence and smoothness of the entire video. Additionally, to maintain visual consistency, feature representations of reference images are injected throughout the generation process, effectively preventing issues such as color inconsistency. In summary, this research significantly improves the quality and realism of video generation by separating and modeling the motions of the foreground and background, and by proposing a new framework for long video generation. It addresses the two major challenges of static backgrounds and accumulated errors present in existing technologies.

Disentangling Foreground and Background Motion for Enhanced Realism in Human Video Generation

Action2video: Generating Videos of Human 3D Actions

Human Motion Transfer from Poses in the Wild

Dancing Avatar: Pose and Text-Guided Human Motion Videos Synthesis with Image Diffusion Model

Do as I Do: Pose Guided Human Motion Copy

MimicMotion: High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance

ActAnywhere: Subject-Aware Video Background Generation

The RNA encompassing the internal ribosome entry site in the poliovirus 5' nontranslated region enhances the encapsidation of genomic RNA.

Human Motion Transfer With 3D Constraints and Detail Enhancement

AMG: Avatar Motion Guided Video Generation

Controllable Longer Image Animation with Diffusion Models

Motion Dreamer: Realizing Physically Coherent Video Generation through Scene-Aware Motion Reasoning

Multi-Frame Content Integration with a Spatio-Temporal Attention Mechanism for Person Video Motion Transfer

Disentangled Human Action Video Generation Via Decoupled Learning.

LEO: Generative Latent Image Animator for Human Video Synthesis

Copy Motion from One to Another: Fake Motion Video Generation

DreaMoving: A Human Video Generation Framework based on Diffusion Models

Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model

Image Comes Dancing With Collaborative Parsing-Flow Video Synthesis

Pose Guided Human Video Generation