Lifting Motion to the 3D World via 2D Diffusion

Jiaman Li,C. Karen Liu,Jiajun Wu
2024-11-28
Abstract:Estimating 3D motion from 2D observations is a long-standing research challenge. Prior work typically requires training on datasets containing ground truth 3D motions, limiting their applicability to activities well-represented in existing motion capture data. This dependency particularly hinders generalization to out-of-distribution scenarios or subjects where collecting 3D ground truth is challenging, such as complex athletic movements or animal motion. We introduce MVLift, a novel approach to predict global 3D motion -- including both joint rotations and root trajectories in the world coordinate system -- using only 2D pose sequences for training. Our multi-stage framework leverages 2D motion diffusion models to progressively generate consistent 2D pose sequences across multiple views, a key step in recovering accurate global 3D motion. MVLift generalizes across various domains, including human poses, human-object interactions, and animal poses. Despite not requiring 3D supervision, it outperforms prior work on five datasets, including those methods that require 3D supervision.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The core problem that this paper attempts to solve is **estimating global 3D motion from 2D observations**, especially for those scenes or subjects (such as complex motions or animal actions) lacking 3D ground - truth data. Specifically, the paper proposes a new method named MVLift, aiming to predict global 3D motion including joint rotations and root trajectories by only using 2D pose sequences for training. ### Problem Background Traditional methods usually need to rely on datasets containing ground - truth 3D motion for training, which limits their applications in activities not fully represented in existing motion - capture data. This dependence especially hinders the generalization ability for scenes or subjects (such as complex sports or animal movements) where it is difficult to collect 3D ground - truth data. ### Solution To solve the above problems, MVLift adopts a multi - stage framework and gradually generates consistent 2D pose sequences across multiple views through a 2D motion diffusion model, which is a crucial step in recovering accurate global 3D motion. The specific steps are as follows: 1. **2D Motion Diffusion Model under Epipolar Constraint**: Train an epipolar - constraint - based diffusion model to generate 2D pose sequences following epipolar constraints. 2. **Multi - view 2D Motion Optimization**: Ensure the geometric relationships and motion authenticity by jointly optimizing multi - view 2D sequences. 3. **Synthetic Multi - view 2D Data Generation**: Utilize the optimized multi - view 2D sequences to recover realistic 3D motion through 2D re - projection targets and generate strictly consistent multi - view 2D sequences. 4. **Multi - view 2D Motion Diffusion Model**: Based on the synthesized multi - view 2D dataset, train a specialized diffusion model to directly generate cross - view - consistent 2D sequences. ### Main Contributions - Propose a new framework MVLift, which can estimate global 3D motion from single - view 2D pose sequences without any 3D training data. - Demonstrate how to gradually establish multi - view consistency through 2D motion diffusion, providing a new perspective for 3D motion estimation. - Experiments prove that MVLift significantly outperforms existing methods on multiple datasets, even if these methods rely on 3D supervised data. ### Summary The main objective of the paper is to solve the problem of estimating global 3D motion from 2D observations, especially in the absence of 3D ground - truth data. MVLift successfully achieves this goal through a multi - stage framework and 2D motion diffusion model, and demonstrates its wide applicability and superior performance in different fields (such as human, animal, and human - machine interaction).