ViMo: Generating Motions from Casual Videos

Liangdong Qiu,Chengxing Yu,Yanran Li,Zhao Wang,Haibin Huang,Chongyang Ma,Di Zhang,Pengfei Wan,Xiaoguang Han

2024-08-13

Abstract:Although humans have the innate ability to imagine multiple possible actions from videos, it remains an extraordinary challenge for computers due to the intricate camera movements and montages. Most existing motion generation methods predominantly rely on manually collected motion datasets, usually tediously sourced from motion capture (Mocap) systems or Multi-View cameras, unavoidably resulting in a limited size that severely undermines their generalizability. Inspired by recent advance of diffusion models, we probe a simple and effective way to capture motions from videos and propose a novel Video-to-Motion-Generation framework (ViMo) which could leverage the immense trove of untapped video content to produce abundant and diverse 3D human motions. Distinct from prior work, our videos could be more causal, including complicated camera movements and occlusions. Striking experimental results demonstrate the proposed model could generate natural motions even for videos where rapid movements, varying perspectives, or frequent occlusions might exist. We also show this work could enable three important downstream applications, such as generating dancing motions according to arbitrary music and source video style. Extensive experimental results prove that our model offers an effective and scalable way to generate diversity and realistic motions. Code and demos will be public soon.

Computer Vision and Pattern Recognition,Multimedia

What problem does this paper attempt to address?

The paper primarily aims to address the following issues: 1. **Generating diverse and realistic 3D human motions**: Existing motion generation methods mostly rely on manually collected motion datasets, which are typically sourced from expensive motion capture systems or multi-view cameras, resulting in limited dataset sizes and insufficient generalization capabilities. 2. **Generating motions using video resources**: Compared to professional motion capture data, video resources are more abundant and diverse. However, the complexity and variability of video content (such as complex camera movements, occlusions, etc.) make it challenging to extract high-quality 3D motions from videos. 3. **Expanding the diversity and flexibility of motion generation**: Most existing motion generation methods are constrained by the specific domains and limited categories in the training datasets, making it difficult to generate diverse motions. To address the above issues, the paper proposes a new framework called ViMo (Video-to-Motion Generation), which can generate diverse and realistic 3D human motions directly from "informal" videos containing complex camera movements and occlusions. Specifically, ViMo adopts a diffusion model approach, which can generate infinitely complex and realistic 3D motions consistent with the input video content context without explicitly estimating the camera position. Additionally, the paper demonstrates the potential of ViMo in three important downstream applications, including building large-scale motion datasets, achieving dance style transfer with a few examples, and video-guided motion completion tasks. In summary, this research aims to expand the diversity and realism of motion generation by utilizing abundant video resources through a simple and effective method, thereby addressing the limitations of existing methods.

ViMo: Generating Motions from Casual Videos

MoVideo: Motion-Aware Video Generation with Diffusion Models

Action2video: Generating Videos of Human 3D Actions

Motion Prompting: Controlling Video Generation with Motion Trajectories

Motion Control for Enhanced Complex Action Video Generation

Towards Efficient and Diverse Generative Model for Unconditional Human Motion Synthesis

MimicMotion: High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance

Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model

MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model

Natural Video-based Human Motion Capture

DiverseMotion: Towards Diverse Human Motion Generation Via Discrete Diffusion

Motion-I2V: Consistent and Controllable Image-to-Video Generation with Explicit Motion Modeling

Motion Dreamer: Realizing Physically Coherent Video Generation through Scene-Aware Motion Reasoning

Human Motion Transfer from Poses in the Wild

Fleximo: Towards Flexible Text-to-Human Motion Video Generation

LaMD: Latent Motion Diffusion for Video Generation

AMG: Avatar Motion Guided Video Generation

Disentangling Foreground and Background Motion for Enhanced Realism in Human Video Generation

MoTrans: Customized Motion Transfer with Text-driven Video Diffusion Models

DreaMoving: A Human Video Generation Framework based on Diffusion Models

Multi-person/Group Interactive Video Generation