Abstract:Human image animation involves generating videos from a character photo, allowing user control and unlocking potential for video and movie production. While recent approaches yield impressive results using high-quality training data, the inaccessibility of these datasets hampers fair and transparent benchmarking. Moreover, these approaches prioritize 2D human motion and overlook the significance of camera motions in videos, leading to limited control and unstable video generation. To demystify the training data, we present HumanVid, the first large-scale high-quality dataset tailored for human image animation, which combines crafted real-world and synthetic data. For the real-world data, we compile a vast collection of copyright-free real-world videos from the internet. Through a carefully designed rule-based filtering strategy, we ensure the inclusion of high-quality videos, resulting in a collection of 20K human-centric videos in 1080P resolution. Human and camera motion annotation is accomplished using a 2D pose estimator and a SLAM-based method. For the synthetic data, we gather 2,300 copyright-free 3D avatar assets to augment existing available 3D assets. Notably, we introduce a rule-based camera trajectory generation method, enabling the synthetic pipeline to incorporate diverse and precise camera motion annotation, which can rarely be found in real-world data. To verify the effectiveness of HumanVid, we establish a baseline model named CamAnimate, short for Camera-controllable Human Animation, that considers both human and camera motions as conditions. Through extensive experimentation, we demonstrate that such simple baseline training on our HumanVid achieves state-of-the-art performance in controlling both human pose and camera motions, setting a new benchmark. Code and data will be publicly available at <a class="link-external link-https" href="https://github.com/zhenzhiwang/HumanVid/" rel="external noopener nofollow">this https URL</a>.

EditHuman: Fine-Grained Text-Driven Human Video Editing

UniEdit: A Unified Tuning-Free Framework for Video Motion and Appearance Editing

UniHuman: A Unified Model for Editing Human Images in the Wild

InstructHumans: Editing Animated 3D Human Textures with Instructions

Edit-Your-Motion: Space-Time Diffusion Decoupling Learning for Video Motion Editing

DeCo: Decoupled Human-Centered Diffusion Video Editing with Motion Consistency

DynVideo-E: Harnessing Dynamic NeRF for Large-Scale Motion- and View-Change Human-Centric Video Editing

Temporally Consistent Object Editing in Videos using Extended Attention

Context-Aware Talking-Head Video Editing

AvatarStudio: Text-driven Editing of 3D Dynamic Human Head Avatars

MagicStick: Controllable Video Editing via Control Handle Transformations

EVA: Zero-shot Accurate Attributes and Multi-Object Video Editing

UniAnimate: Taming Unified Video Diffusion Models for Consistent Human Image Animation

A Robust Interactive Facial Animation Editing System

Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image Diffusion Models

Editing like Humans: A Contextual, Multimodal Framework for Automated Video Editing

Zero-shot Text-driven Physically Interpretable Face Editing

HumanVid: Demystifying Training Data for Camera-controllable Human Image Animation

HumanRefiner: Benchmarking Abnormal Human Generation and Refining with Coarse-to-fine Pose-Reversible Guidance

Neural Animation and Reenactment of Human Actor Videos