Abstract:Estimating 3D motion from 2D observations is a long-standing research challenge. Prior work typically requires training on datasets containing ground truth 3D motions, limiting their applicability to activities well-represented in existing motion capture data. This dependency particularly hinders generalization to out-of-distribution scenarios or subjects where collecting 3D ground truth is challenging, such as complex athletic movements or animal motion. We introduce MVLift, a novel approach to predict global 3D motion -- including both joint rotations and root trajectories in the world coordinate system -- using only 2D pose sequences for training. Our multi-stage framework leverages 2D motion diffusion models to progressively generate consistent 2D pose sequences across multiple views, a key step in recovering accurate global 3D motion. MVLift generalizes across various domains, including human poses, human-object interactions, and animal poses. Despite not requiring 3D supervision, it outperforms prior work on five datasets, including those methods that require 3D supervision.

What problem does this paper attempt to address?

The core problem that this paper attempts to solve is **estimating global 3D motion from 2D observations**, especially for those scenes or subjects (such as complex motions or animal actions) lacking 3D ground - truth data. Specifically, the paper proposes a new method named MVLift, aiming to predict global 3D motion including joint rotations and root trajectories by only using 2D pose sequences for training. ### Problem Background Traditional methods usually need to rely on datasets containing ground - truth 3D motion for training, which limits their applications in activities not fully represented in existing motion - capture data. This dependence especially hinders the generalization ability for scenes or subjects (such as complex sports or animal movements) where it is difficult to collect 3D ground - truth data. ### Solution To solve the above problems, MVLift adopts a multi - stage framework and gradually generates consistent 2D pose sequences across multiple views through a 2D motion diffusion model, which is a crucial step in recovering accurate global 3D motion. The specific steps are as follows: 1. **2D Motion Diffusion Model under Epipolar Constraint**: Train an epipolar - constraint - based diffusion model to generate 2D pose sequences following epipolar constraints. 2. **Multi - view 2D Motion Optimization**: Ensure the geometric relationships and motion authenticity by jointly optimizing multi - view 2D sequences. 3. **Synthetic Multi - view 2D Data Generation**: Utilize the optimized multi - view 2D sequences to recover realistic 3D motion through 2D re - projection targets and generate strictly consistent multi - view 2D sequences. 4. **Multi - view 2D Motion Diffusion Model**: Based on the synthesized multi - view 2D dataset, train a specialized diffusion model to directly generate cross - view - consistent 2D sequences. ### Main Contributions - Propose a new framework MVLift, which can estimate global 3D motion from single - view 2D pose sequences without any 3D training data. - Demonstrate how to gradually establish multi - view consistency through 2D motion diffusion, providing a new perspective for 3D motion estimation. - Experiments prove that MVLift significantly outperforms existing methods on multiple datasets, even if these methods rely on 3D supervised data. ### Summary The main objective of the paper is to solve the problem of estimating global 3D motion from 2D observations, especially in the absence of 3D ground - truth data. MVLift successfully achieves this goal through a multi - stage framework and 2D motion diffusion model, and demonstrates its wide applicability and superior performance in different fields (such as human, animal, and human - machine interaction).

Lifting Motion to the 3D World via 2D Diffusion

Forecasting Distillation: Enhancing 3D Human Motion Prediction with Guidance Regularization

Motion Diffusion-Guided 3D Global HMR from a Dynamic Camera

Executing Your Commands Via Motion Diffusion in Latent Space.

MAS: Multi-view Ancestral Sampling for 3D motion generation using 2D diffusion

Avatars Grow Legs: Generating Smooth Human Motion from Sparse Tracking Inputs with Diffusion Model

Synthesizing Moving People with 3D Control

BoDiffusion: Diffusing Sparse Observations for Full-Body Human Motion Synthesis

E-Motion: Future Motion Simulation via Event Sequence Diffusion

RMD: A Simple Baseline for More General Human Motion Generation via Training-free Retrieval-Augmented Motion Diffuse

A Unified Diffusion Framework for Scene-aware Human Motion Estimation from Sparse Signals

MDMP: Multi-modal Diffusion for supervised Motion Predictions with uncertainty

TransFusion: A Practical and Effective Transformer-based Diffusion Model for 3D Human Motion Prediction

DiverseMotion: Towards Diverse Human Motion Generation Via Discrete Diffusion

MotionDiffuser: Controllable Multi-Agent Motion Prediction using Diffusion

Realistic Human Motion Generation with Cross-Diffusion Models

M2Diffuser: Diffusion-based Trajectory Optimization for Mobile Manipulation in 3D Scenes

Human Joint Kinematics Diffusion-Refinement for Stochastic Motion Prediction

Learning Human Motion from Monocular Videos via Cross-Modal Manifold Alignment

Sim2real transfer learning for 3D human pose estimation: motion to the rescue

Lift3D: Zero-Shot Lifting of Any 2D Vision Model to 3D