Abstract:Human motion capture from monocular videos has made significant progress in recent years. However, modern approaches often produce temporal artifacts, e.g. in form of jittery motion and struggle to achieve smooth and physically plausible motions. Explicitly integrating physics, in form of internal forces and exterior torques, helps alleviating these artifacts. Current state-of-the-art approaches make use of an automatic PD controller to predict torques and reaction forces in order to re-simulate the input kinematics, i.e. the joint angles of a predefined skeleton. However, due to imperfect physical models, these methods often require simplifying assumptions and extensive preprocessing of the input kinematics to achieve good performance. To this end, we propose a novel method to selectively incorporate the physics models with the kinematics observations in an online setting, inspired by a neural Kalman-filtering approach. We develop a control loop as a meta-PD controller to predict internal joint torques and external reaction forces, followed by a physics-based motion simulation. A recurrent neural network is introduced to realize a Kalman filter that attentively balances the kinematics input and simulated motion, resulting in an optimal-state dynamics prediction. We show that this filtering step is crucial to provide an online supervision that helps balancing the shortcoming of the respective input motions, thus being important for not only capturing accurate global motion trajectories but also producing physically plausible human poses. The proposed approach excels in the physics-based human pose estimation task and demonstrates the physical plausibility of the predictive dynamics, compared to state of the art. The code is available on <a class="link-external link-https" href="https://github.com/cuongle1206/OSDCap" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problems of temporal artifacts (such as jitter) often produced by existing methods and the difficulty in achieving smooth and physically reasonable motion when performing human motion capture from monocular videos. Specifically, modern methods often cannot completely eliminate these artifacts when dealing with human motion, resulting in generated motions that are not realistic and natural enough. To overcome these problems, the author proposes a new method - OSDCap (Optimal - state Dynamics Estimation for Physics - based Human Motion Capture from Videos), which combines dynamic observations and physics simulations to estimate the dynamics of the optimal state in an online manner. By introducing a learnable Kalman filter and an inertial bias matrix, OSDCap can more accurately capture the global motion trajectory and generate physically reasonable postures. #### Main challenges: 1. **Temporal artifacts**: The motions generated by existing methods usually contain temporal artifacts such as jitter. 2. **Physical plausibility**: Existing video - based human motion capture methods have difficulty ensuring the physical plausibility of motion. 3. **Noise processing**: 3D pose estimation from monocular videos usually contains a large amount of noise, which affects the accuracy of prediction. #### Solutions: OSDCap solves the above problems through the following steps: 1. **Dynamics simulation**: Use a meta - PD controller to predict joint torques and external reaction forces, and then perform physics - based motion simulation. 2. **Adaptive filtering**: Introduce a recurrent neural network to implement a Kalman filter, balance dynamic inputs and simulated motions, and finally generate a dynamics prediction of the optimal state. 3. **Inertial bias matrix**: Predict an inertial bias matrix, correct the initial inertial matrix, and reduce the error caused by the simplification of the human body structure. Through these improvements, OSDCap not only improves the accuracy of motion capture but also ensures the physical plausibility of motion, thus building a bridge between computer vision and complex human motion modeling. ### Formula summary - **Newton's equations of motion**: \[ M(q)\ddot{q}=\tau + \lambda - h(q,\dot{q}) \] where \(M(q)\in\mathbb{R}^{(6 + 3N)\times(6 + 3N)}\) is the inertia matrix, \(\ddot{q}\in\mathbb{R}^{6+3N}\) is the acceleration, \(\tau\in\mathbb{R}^{6+3N}\) is the internal joint torque, \(\lambda\in\mathbb{R}^{6+3N}\) is the external force, and \(h(q,\dot{q})\in\mathbb{R}^{6+3N}\) contains gravity, Coriolis force, and centrifugal force. - **Kalman filter update**: \[ q_{t + 1|t+1}=q_{t+1|t}+K_t(C\hat{q}_t - Hq_{t+1|t}) \] where \(K_t\) is the Kalman gain matrix, and \(C\) and \(H\) are the adaptation matrix and the observation matrix respectively. - **PD controller predicts joint torque**: \[ \tau_t=\kappa_P(q_{t+1|t+1}-q_{t+1|t})+\kappa_D\dot{q}_{t|t} \] - **External force estimation**: \[ \lambda_t = 2\sum_{c = 1}^{J_c}\rho_c^tf_c^t \] These formulas together form the core algorithm of OSDCap, ensuring the accuracy and physical plausibility of motion capture.

Optimal-state Dynamics Estimation for Physics-based Human Motion Capture from Videos

Optimal-State Dynamics Estimation for Physics-based Human Motion Capture from Videos

Physics-Guided Human Motion Capture with Pose Probability Modeling

Real-time Physics-based Motion Capture with Sparse Sensors

Physics-based Human Motion Estimation and Synthesis from Videos

Physical Inertial Poser (PIP): Physics-aware Real-time Human Motion Tracking from Sparse Inertial Sensors

D&D: Learning Human Dynamics from Dynamic Camera

Contact and Human Dynamics from Monocular Video

Neural MoCon: Neural Motion Control for Physically Plausible Human Motion Capture

Leveraging depth cameras and wearable pressure sensors for full-body kinematics and dynamics capture

PhysPT: Physics-aware Pretrained Transformer for Estimating Human Dynamics from Monocular Videos

Kinematics Modeling Network for Video-based Human Pose Estimation

Efficient Human Motion Reconstruction from Monocular Videos with Physical Consistency Loss.

DragPoser: Motion Reconstruction from Variable Sparse Tracking Signals via Latent Space Optimization

AddBiomechanics Dataset: Capturing the Physics of Human Motion at Scale

Accurate Real-Time Joint Torque Estimation for Dynamic Prediction of Human Locomotion

SimPoE: Simulated Character Control for 3D Human Pose Estimation

PhysMotion: Physics-Grounded Dynamics From a Single Image

DROP: Dynamics Responses from Human Motion Prior and Projective Dynamics

Physics-Based Object 6D-Pose Estimation during Non-Prehensile Manipulation