Abstract:Recent advancements in human video synthesis have enabled the generation of high-quality videos through the application of stable diffusion models. However, existing methods predominantly concentrate on animating solely the human element (the foreground) guided by pose information, while leaving the background entirely static. Contrary to this, in authentic, high-quality videos, backgrounds often dynamically adjust in harmony with foreground movements, eschewing stagnancy. We introduce a technique that concurrently learns both foreground and background dynamics by segregating their movements using distinct motion representations. Human figures are animated leveraging pose-based motion, capturing intricate actions. Conversely, for backgrounds, we employ sparse tracking points to model motion, thereby reflecting the natural interaction between foreground activity and environmental changes. Training on real-world videos enhanced with this innovative motion depiction approach, our model generates videos exhibiting coherent movement in both foreground subjects and their surrounding contexts. To further extend video generation to longer sequences without accumulating errors, we adopt a clip-by-clip generation strategy, introducing global features at each step. To ensure seamless continuity across these segments, we ingeniously link the final frame of a produced clip with input noise to spawn the succeeding one, maintaining narrative flow. Throughout the sequential generation process, we infuse the feature representation of the initial reference image into the network, effectively curtailing any cumulative color inconsistencies that may otherwise arise. Empirical evaluations attest to the superiority of our method in producing videos that exhibit harmonious interplay between foreground actions and responsive background dynamics, surpassing prior methodologies in this regard.

Deep Video Generation, Prediction and Completion of Human Action Sequences

OAW-GAN: Occlusion-Aware Warping GAN for Unified Human Video Synthesis

Action2video: Generating Videos of Human 3D Actions

Pose Guided Human Video Generation

FutureHuman3D: Forecasting Complex Long-Term 3D Human Behavior from Video Observations

Deep Gesture Video Generation with Learning on Regions of Interest

Adaptive Hierarchical Motion-Focused Model for Video Prediction.

Action-conditioned video data improves predictability

Deep Generative Modelling of Human Reach-and-Place Action

Disentangling Foreground and Background Motion for Enhanced Realism in Human Video Generation

Conditional Temporal Variational AutoEncoder for Action Video Prediction

Long-Term Human Video Generation of Multiple Futures Using Poses

Music Conditioned Generation for Human-Centric Video

Human Motion Transfer from Poses in the Wild

Pose-guided Generative Adversarial Net for Novel View Action Synthesis

Human Action Generation with Generative Adversarial Networks

Do As I Do: Pose Guided Human Motion Copy

Learning a Generative Model for Multi‐Step Human‐Object Interactions from Videos

3D Human motion anticipation and classification

MimicMotion: High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance