Abstract:Human image animation involves generating videos from a character photo, allowing user control and unlocking potential for video and movie production. While recent approaches yield impressive results using high-quality training data, the inaccessibility of these datasets hampers fair and transparent benchmarking. Moreover, these approaches prioritize 2D human motion and overlook the significance of camera motions in videos, leading to limited control and unstable video generation. To demystify the training data, we present HumanVid, the first large-scale high-quality dataset tailored for human image animation, which combines crafted real-world and synthetic data. For the real-world data, we compile a vast collection of copyright-free real-world videos from the internet. Through a carefully designed rule-based filtering strategy, we ensure the inclusion of high-quality videos, resulting in a collection of 20K human-centric videos in 1080P resolution. Human and camera motion annotation is accomplished using a 2D pose estimator and a SLAM-based method. For the synthetic data, we gather 2,300 copyright-free 3D avatar assets to augment existing available 3D assets. Notably, we introduce a rule-based camera trajectory generation method, enabling the synthetic pipeline to incorporate diverse and precise camera motion annotation, which can rarely be found in real-world data. To verify the effectiveness of HumanVid, we establish a baseline model named CamAnimate, short for Camera-controllable Human Animation, that considers both human and camera motions as conditions. Through extensive experimentation, we demonstrate that such simple baseline training on our HumanVid achieves state-of-the-art performance in controlling both human pose and camera motions, setting a new benchmark. Code and data will be publicly available at <a class="link-external link-https" href="https://github.com/zhenzhiwang/HumanVid/" rel="external noopener nofollow">this https URL</a>.

FLAG3D: A 3D Fitness Activity Dataset with Language Instruction

Playing for 3D Human Recovery

DGU-HAO: A Dataset With Daily Life Objects for Comprehensive 3D Human Action Analysis

FineGym: A Hierarchical Video Dataset for Fine-Grained Action Understanding

AI Coach: Deep Human Pose Estimation and Analysis for Personalized Athletic Training Assistance

APP: Adaptive Pose Pooling for 3D Human Pose Estimation from Videos

3DYoga90: A Hierarchical Video Dataset for Yoga Pose Understanding

Motion-X: A Large-scale 3D Expressive Whole-body Human Motion Dataset

BABEL: Bodies, Action and Behavior with English Labels

Improving Annotation for 3D Pose Dataset of Fine-Grained Object Categories

Language-Conditioned Affordance-Pose Detection in 3D Point Clouds

EgoExo-Fitness: Towards Egocentric and Exocentric Full-Body Action Understanding

VLA-3D: A Dataset for 3D Semantic Scene Understanding and Navigation

Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

Real-Time Fitness Exercise Classification and Counting from Video Frames

HuMMan: Multi-Modal 4D Human Dataset for Versatile Sensing and Modeling

HumanVid: Demystifying Training Data for Camera-controllable Human Image Animation

A Large-Scale Synthetic Gait Dataset Towards In-the-wild Simulation and Comparison Study.

ADL4D: Towards A Contextually Rich Dataset for 4D Activities of Daily Living

SignAvatars: A Large-scale 3D Sign Language Holistic Motion Dataset and Benchmark

HUMAN4D: A Human-Centric Multimodal Dataset for Motions and Immersive Media