Denis Tome,Thiemo Alldieck,Patrick Peluse,Gerard Pons-Moll,Lourdes Agapito,Hernan Badino,Fernando De la Torre
Abstract:We present a solution to egocentric 3D body pose estimation from monocular images captured from downward looking fish-eye cameras installed on the rim of a head mounted VR device. This unusual viewpoint leads to images with unique visual appearance, with severe self-occlusions and perspective distortions that result in drastic differences in resolution between lower and upper body. We propose an encoder-decoder architecture with a novel multi-branch decoder designed to account for the varying uncertainty in 2D predictions. The quantitative evaluation, on synthetic and real-world datasets, shows that our strategy leads to substantial improvements in accuracy over state of the art egocentric approaches. To tackle the lack of labelled data we also introduced a large photo-realistic synthetic dataset. xR-EgoPose offers high quality renderings of people with diverse skintones, body shapes and clothing, performing a range of actions. Our experiments show that the high variability in our new synthetic training corpus leads to good generalization to real world footage and to state of theart results on real world datasets with ground truth. Moreover, an evaluation on the Human3.6M benchmark shows that the performance of our method is on par with top performing approaches on the more classic problem of 3D human pose from a third person viewpoint.
What problem does this paper attempt to address?
### What problem does this paper attempt to solve?
This paper aims to solve the problem of first - person view (egocentric) 3D human pose estimation obtained from fisheye cameras installed on head - mounted virtual reality (VR) devices. Specifically, the paper focuses on how to accurately estimate the position and rotation information of human joints from images captured by a downward - looking monocular fisheye camera.
#### Main challenges
1. **Strong perspective distortion**: Due to the fisheye lens and the very close distance between the camera and the face, the image has strong radial distortion, and the resolution difference between the upper and lower body is extremely large.
2. **Severe self - occlusion**: In the first - person view, some parts of the body are easily occluded, especially the lower body, which poses high requirements for spatial perception.
3. **Data scarcity**: First - person - view 3D human pose estimation is a relatively new research field, and publicly available annotated datasets are very limited.
4. **Natural ambiguity**: There is an inherent ambiguity when elevating 2D joint positions to 3D.
#### Solutions
To solve these problems, the authors propose a new encoder - decoder architecture, which includes a multi - branch decoder specifically designed to handle the uncertainty of different joint positions. In addition, to address the problem of insufficient training data, the authors also introduce a large - scale, realistic synthetic dataset xR - EgoPose, which contains 383,000 frames of high - quality rendered images, covering a variety of skin colors, body types, clothing, backgrounds and lighting conditions.
#### Experimental results
Through quantitative and qualitative evaluations on synthetic and real - world datasets, the experimental results show that this method is significantly superior to existing first - person pose estimation methods in terms of accuracy, and also reaches the state - of - the - art level in the standard frontal - view 3D human pose reconstruction task.
### Formula representation
In terms of formula representation, the loss functions and error calculations involved in the paper are as follows:
1. **Loss function for 2D pose detection**:
\[
L_{2D}=\text{mse}(HM,\hat{HM})
\]
where \(HM\) is the real heat map and \(\hat{HM}\) is the predicted heat map.
2. **Overall loss function of the auto - encoder**:
\[
L_{AE}=\lambda_p(\|P - \hat{P}\|_2+W(P,\hat{P}))+\lambda_r\|\hat{R}-r(\hat{P})\|_2+\lambda_{hm}\|\hat{HM}-\tilde{HM}\|_2
\]
where:
- \(P\) is the real 3D pose;
- \(\hat{P}\) is the predicted 3D pose;
- \(\hat{R}\) is the predicted local joint rotation;
- \(r(\hat{P})\) is the local joint rotation extracted from the predicted pose;
- \(\tilde{HM}\) is the heat map regressed from the latent space;
- \(\hat{HM}\) is the heat map generated by the 2D pose estimation module.
3. **Regularization term for 3D pose**:
\[
W(P,\hat{P})=\lambda_\theta\theta(P,\hat{P})+\lambda_LL(P,\hat{P})
\]
where:
- \(\theta(P,\hat{P})=\sum_{l}\frac{P_l\cdot\hat{P}_l}{\|P_l\|\|\hat{P}_l\|}\) represents the cosine similarity error;
- \(L(P,\hat{P})=\sum_{l}\|P_l - \hat{P}_l\|\) represents the limb length error.
These formulas ensure that the model can effectively learn and predict 3D poses while maintaining the ability to model uncertainty.