Abstract:We present a solution to egocentric 3D body pose estimation from monocular images captured from downward looking fish-eye cameras installed on the rim of a head mounted VR device. This unusual viewpoint leads to images with unique visual appearance, with severe self-occlusions and perspective distortions that result in drastic differences in resolution between lower and upper body. We propose an encoder-decoder architecture with a novel multi-branch decoder designed to account for the varying uncertainty in 2D predictions. The quantitative evaluation, on synthetic and real-world datasets, shows that our strategy leads to substantial improvements in accuracy over state of the art egocentric approaches. To tackle the lack of labelled data we also introduced a large photo-realistic synthetic dataset. xR-EgoPose offers high quality renderings of people with diverse skintones, body shapes and clothing, performing a range of actions. Our experiments show that the high variability in our new synthetic training corpus leads to good generalization to real world footage and to state of theart results on real world datasets with ground truth. Moreover, an evaluation on the Human3.6M benchmark shows that the performance of our method is on par with top performing approaches on the more classic problem of 3D human pose from a third person viewpoint.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of first - person view (egocentric) 3D human pose estimation obtained from fisheye cameras installed on head - mounted virtual reality (VR) devices. Specifically, the paper focuses on how to accurately estimate the position and rotation information of human joints from images captured by a downward - looking monocular fisheye camera. #### Main challenges 1. **Strong perspective distortion**: Due to the fisheye lens and the very close distance between the camera and the face, the image has strong radial distortion, and the resolution difference between the upper and lower body is extremely large. 2. **Severe self - occlusion**: In the first - person view, some parts of the body are easily occluded, especially the lower body, which poses high requirements for spatial perception. 3. **Data scarcity**: First - person - view 3D human pose estimation is a relatively new research field, and publicly available annotated datasets are very limited. 4. **Natural ambiguity**: There is an inherent ambiguity when elevating 2D joint positions to 3D. #### Solutions To solve these problems, the authors propose a new encoder - decoder architecture, which includes a multi - branch decoder specifically designed to handle the uncertainty of different joint positions. In addition, to address the problem of insufficient training data, the authors also introduce a large - scale, realistic synthetic dataset xR - EgoPose, which contains 383,000 frames of high - quality rendered images, covering a variety of skin colors, body types, clothing, backgrounds and lighting conditions. #### Experimental results Through quantitative and qualitative evaluations on synthetic and real - world datasets, the experimental results show that this method is significantly superior to existing first - person pose estimation methods in terms of accuracy, and also reaches the state - of - the - art level in the standard frontal - view 3D human pose reconstruction task. ### Formula representation In terms of formula representation, the loss functions and error calculations involved in the paper are as follows: 1. **Loss function for 2D pose detection**: \[ L_{2D}=\text{mse}(HM,\hat{HM}) \] where \(HM\) is the real heat map and \(\hat{HM}\) is the predicted heat map. 2. **Overall loss function of the auto - encoder**: \[ L_{AE}=\lambda_p(\|P - \hat{P}\|_2+W(P,\hat{P}))+\lambda_r\|\hat{R}-r(\hat{P})\|_2+\lambda_{hm}\|\hat{HM}-\tilde{HM}\|_2 \] where: - \(P\) is the real 3D pose; - \(\hat{P}\) is the predicted 3D pose; - \(\hat{R}\) is the predicted local joint rotation; - \(r(\hat{P})\) is the local joint rotation extracted from the predicted pose; - \(\tilde{HM}\) is the heat map regressed from the latent space; - \(\hat{HM}\) is the heat map generated by the 2D pose estimation module. 3. **Regularization term for 3D pose**: \[ W(P,\hat{P})=\lambda_\theta\theta(P,\hat{P})+\lambda_LL(P,\hat{P}) \] where: - \(\theta(P,\hat{P})=\sum_{l}\frac{P_l\cdot\hat{P}_l}{\|P_l\|\|\hat{P}_l\|}\) represents the cosine similarity error; - \(L(P,\hat{P})=\sum_{l}\|P_l - \hat{P}_l\|\) represents the limb length error. These formulas ensure that the model can effectively learn and predict 3D poses while maintaining the ability to model uncertainty.

SelfPose: 3D Egocentric Pose Estimation from a Headset Mounted Camera

xR-EgoPose: Egocentric 3D Human Pose from an HMD Camera

Scene-aware Egocentric 3D Human Pose Estimation

SimpleEgo: Predicting Probabilistic Body Pose from Egocentric Cameras

3D Human Pose Perception from Egocentric Stereo Videos

Efficient Multi-person Hierarchical 3D Pose Estimation for Autonomous Driving

Ego3DPose: Capturing 3D Cues from Binocular Egocentric Views

EgoPoseFormer: A Simple Baseline for Stereo Egocentric 3D Human Pose Estimation

EgoPoser: Robust Real-Time Egocentric Pose Estimation from Sparse and Intermittent Observations Everywhere

Ego-Body Pose Estimation via Ego-Head Pose Estimation

EgoFish3D: Egocentric 3D Pose Estimation from a Fisheye Camera Via Self-Supervised Learning

Seeing Invisible Poses: Estimating 3D Body Pose from Egocentric Video

Mo2Cap2: Real-time Mobile 3D Motion Capture with a Cap-mounted Fisheye Camera

4D Human Body Capture from Egocentric Video via 3D Scene Grounding

Ego+X: an Egocentric Vision System for Global 3D Human Pose Estimation and Social Interaction Characterization

EgoHumans: An Egocentric 3D Multi-Human Benchmark

EgoCap: Egocentric Marker-less Motion Capture with Two Fisheye Cameras

Unsupervised Universal Hierarchical Multi-Person 3D Pose Estimation for Natural Scenes

EgoBody: Human Body Shape and Motion of Interacting People from Head-Mounted Devices

You2Me: Inferring Body Pose in Egocentric Video via First and Second Person Interactions

Estimating Ego-Body Pose from Doubly Sparse Egocentric Video Data