6DoF Head Pose Estimation through Explicit Bidirectional Interaction with Face Geometry

Sungho Chun,Ju Yong Chang
2024-07-19
Abstract:This study addresses the nuanced challenge of estimating head translations within the context of six-degrees-of-freedom (6DoF) head pose estimation, placing emphasis on this aspect over the more commonly studied head rotations. Identifying a gap in existing methodologies, we recognized the underutilized potential synergy between facial geometry and head translation. To bridge this gap, we propose a novel approach called the head Translation, Rotation, and face Geometry network (TRG), which stands out for its explicit bidirectional interaction structure. This structure has been carefully designed to leverage the complementary relationship between face geometry and head translation, marking a significant advancement in the field of head pose estimation. Our contributions also include the development of a strategy for estimating bounding box correction parameters and a technique for aligning landmarks to image. Both of these innovations demonstrate superior performance in 6DoF head pose estimation tasks. Extensive experiments conducted on ARKitFace and BIWI datasets confirm that the proposed method outperforms current state-of-the-art techniques. Codes are released at <a class="link-external link-https" href="https://github.com/asw91666/TRG-Release" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the head translation estimation problem in 6 - degree - of - freedom (6DoF) head pose estimation. Specifically, the author points out that most of the existing research mainly focuses on the estimation of head rotation, while less attention is paid to the estimation of head translation. In addition, existing methods face challenges when dealing with the estimation of head translation from a single image, especially due to the interdependence and ambiguity between the actual - scale facial geometry and head translation. To solve these problems, the author proposes a new method, called **Translation, Rotation, and face Geometry network (TRG)**. By introducing an explicit two - way interaction structure, TRG makes full use of the complementary relationship between facial geometry information and head translation, thereby improving the accuracy of 6DoF head pose estimation. The following are the main contributions of this method: 1. **Explicit two - way interaction structure**: TRG first introduces an explicit two - way interaction structure between head translation and facial geometry. Through this innovative structure, TRG can simultaneously reduce the ambiguity of head depth and face size. 2. **Bounding box correction parameter estimation strategy**: TRG proposes a strategy for estimating bounding box correction parameters, which shows stable generalization performance when dealing with out - of - distribution data. 3. **Landmark - to - image alignment strategy**: TRG adopts a landmark - to - image alignment strategy, which not only improves the accuracy of head translation estimation but also improves the estimation precision of head rotation. 4. **Depth - aware landmark prediction architecture**: The depth - aware landmark prediction architecture of TRG shows high precision when dealing with images that are greatly affected by perspective distortion, such as selfies. 5. **Experimental results**: Extensive experiments on the ARKitFace and BIWI datasets show that TRG outperforms the current state - of - the - art methods in the 6DoF head pose estimation task. ### Formula Representation Some formulas involved in the paper are as follows: - The calculation formula of head translation \(T_t\): \[ T_{x_t}=0.2s_t\left(\frac{\tau_{x,\text{bbox}}}{b}+\tilde{\tau}_{x,\text{face}_t}\right) \] \[ T_{y_t}=0.2s_t\left(\frac{\tau_{y,\text{bbox}}}{b}+\tilde{\tau}_{y,\text{face}_t}\right) \] \[ T_{z_t}=0.2s_t\left(\frac{f}{b}\right) \] - The calculation formula of the image coordinates \(V^{\text{img}}_t\) of dense landmarks: \[ V^{\text{img}}_t = \Pi(V_t, R_t, T_t, K) \] These formulas show how TRG uses bounding box information and correction parameters to estimate head translation and maps dense landmarks to the image space through perspective projection.