MIMO: Controllable Character Video Synthesis with Spatial Decomposed Modeling

Yifang Men,Yuan Yao,Miaomiao Cui,Liefeng Bo

2024-09-24

Abstract:Character video synthesis aims to produce realistic videos of animatable characters within lifelike scenes. As a fundamental problem in the computer vision and graphics community, 3D works typically require multi-view captures for per-case training, which severely limits their applicability of modeling arbitrary characters in a short time. Recent 2D methods break this limitation via pre-trained diffusion models, but they struggle for pose generality and scene interaction. To this end, we propose MIMO, a novel framework which can not only synthesize character videos with controllable attributes (i.e., character, motion and scene) provided by simple user inputs, but also simultaneously achieve advanced scalability to arbitrary characters, generality to novel 3D motions, and applicability to interactive real-world scenes in a unified framework. The core idea is to encode the 2D video to compact spatial codes, considering the inherent 3D nature of video occurrence. Concretely, we lift the 2D frame pixels into 3D using monocular depth estimators, and decompose the video clip to three spatial components (i.e., main human, underlying scene, and floating occlusion) in hierarchical layers based on the 3D depth. These components are further encoded to canonical identity code, structured motion code and full scene code, which are utilized as control signals of synthesis process. The design of spatial decomposed modeling enables flexible user control, complex motion expression, as well as 3D-aware synthesis for scene interactions. Experimental results demonstrate effectiveness and robustness of the proposed method.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

This paper aims to address the problem of controllable character video synthesis. Specifically, the research team proposes a new framework, MIMO, which enables high controllability over attributes such as characters, actions, and scenes within a unified framework. It can handle arbitrary characters, novel 3D actions, and complex interactive scenes. Existing methods either require multi-view capture training (3D methods), which limits their ability to model arbitrary characters in a short time, or, although breaking this limitation, struggle to handle complex actions and scene interactions (2D methods). MIMO addresses these issues by encoding 2D videos into compact spatial codes and considering the inherent 3D characteristics of the video. Specifically, MIMO lifts 2D frame pixels into 3D space and decomposes video clips into three layers based on 3D depth: main characters, background scenes, and floating occlusions. These are then further encoded into identity codes, structured motion codes, and complete scene codes, serving as control signals during the synthesis process. This approach not only allows flexible control over video attributes by the user but also expresses complex actions and supports 3D-aware scene interactions. Experimental results show that this method performs excellently in terms of both effectiveness and robustness.

MIMO: Controllable Character Video Synthesis with Spatial Decomposed Modeling

Video-based Characters

MovieCharacter: A Tuning-Free Framework for Controllable Character Video Synthesis

COMD: Training-free Video Motion Transfer with Camera-Object Motion Disentanglement

CFSynthesis: Controllable and Free-view 3D Human Video Synthesis

I2VControl: Disentangled and Unified Video Motion Synthesis Control

Real-Time Neural Character Rendering with Pose-Guided Multiplane Images

ViMo: Generating Motions from Casual Videos

A Unified 3D Human Motion Synthesis Model Via Conditional Variational Auto-Encoder

Motion Control for Enhanced Complex Action Video Generation

Synthesizing Moving People with 3D Control

Video^M: Multi-video Synopsis

VideoComposer: Compositional Video Synthesis with Motion Controllability

Animate3D: Animating Any 3D Model with Multi-view Video Diffusion

MIMOSA: Human-AI Co-Creation of Computational Spatial Audio Effects on Videos

Stereo-Talker: Audio-driven 3D Human Synthesis with Prior-Guided Mixture-of-Experts

Motion-I2V: Consistent and Controllable Image-to-Video Generation with Explicit Motion Modeling

MIMO Is All You Need : A Strong Multi-In-Multi-Out Baseline for Video Prediction

Transferring of Speech Movements from Video to 3D Face Space

MOSO: Decomposing MOtion, Scene and Object for Video Prediction

ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis