6DGS: 6D Pose Estimation from a Single Image and a 3D Gaussian Splatting Model

Matteo Bortolon,Theodore Tsesmelis,Stuart James,Fabio Poiesi,Alessio Del Bue
2024-07-22
Abstract:We propose 6DGS to estimate the camera pose of a target RGB image given a 3D Gaussian Splatting (3DGS) model representing the scene. 6DGS avoids the iterative process typical of analysis-by-synthesis methods (e.g. iNeRF) that also require an initialization of the camera pose in order to converge. Instead, our method estimates a 6DoF pose by inverting the 3DGS rendering process. Starting from the object surface, we define a radiant Ellicell that uniformly generates rays departing from each ellipsoid that parameterize the 3DGS model. Each Ellicell ray is associated with the rendering parameters of each ellipsoid, which in turn is used to obtain the best bindings between the target image pixels and the cast rays. These pixel-ray bindings are then ranked to select the best scoring bundle of rays, which their intersection provides the camera center and, in turn, the camera rotation. The proposed solution obviates the necessity of an "a priori" pose for initialization, and it solves 6DoF pose estimation in closed form, without the need for iterations. Moreover, compared to the existing Novel View Synthesis (NVS) baselines for pose estimation, 6DGS can improve the overall average rotational accuracy by 12% and translation accuracy by 22% on real scenes, despite not requiring any initialization pose. At the same time, our method operates near real-time, reaching 15fps on consumer hardware.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is **6 - Degree - of - Freedom (6DoF) camera pose estimation under a single - image and 3D Gaussian point - painting model**. Specifically, the paper proposes a method named 6DGS, aiming to estimate the camera pose of the target RGB image by given a 3D Gaussian point - painting (3DGS) model. The main innovation of the 6DGS method is that it avoids the iterative process required by traditional analysis - synthesis methods (such as iNeRF) and the requirement of the initial camera pose, but directly estimates the 6DoF pose by reversing the 3DGS rendering process. ### Main Problems and Solutions 1. **Problem**: Existing 6D camera pose estimation methods based on Neural Radiance Field (NeRF) (such as iNeRF) usually require a nearly correct initial camera pose to converge, and these methods rely on the iterative process, which leads to problems of low computational efficiency and being prone to getting stuck in local minima. 2. **Solutions**: - **6DGS Method**: By introducing a new ray - generation mechanism - Ellicell, rays are uniformly generated from the ellipsoidal surface of the 3D Gaussian point - painting model. Each ray is associated with the rendering parameters of the ellipsoid and is used to obtain the best binding between the target - image pixels and the projected rays. - **Attention Mechanism**: Use the learned attention map to select the best ray bundles. The intersections of these ray bundles provide the camera center position, and then the camera rotation is estimated. - **Closed - Form Solution**: 6DGS solves the ray intersections by the weighted least - squares method (wLS), thereby estimating the 6DoF camera pose in a closed - form without the need for an initial pose. ### Technical Details - **Ellicell Generation**: Divide the ellipsoidal surface into equal - area cells, and generate one ray from each cell. The direction of the ray is from the center of the ellipsoid to the center of the cell. - **Ray - Pixel Binding**: Use the DINOv2 pre - trained visual feature extractor to generate image features, and generate ray features through a multi - layer perceptron (MLP) and position encoding. The attention mechanism is used to match ray features and image features and select the ray bundles with the highest scores. - **Pose Estimation**: Solve the intersections of the selected rays by the weighted least - squares method to obtain the translation and rotation parameters of the camera. ### Experimental Results - **Dataset**: The paper conducted experiments on two real - world datasets, namely Tanks & Temples and Mip - NeRF 360°. - **Performance Comparison**: Compared with existing NeRF - based 6D pose estimation methods, 6DGS improves the average rotation accuracy by 12% and the translation accuracy by 22% without providing an initial pose. - **Real - Time Performance**: 6DGS can achieve a processing speed of 15 frames per second on consumer - grade hardware and is suitable for near - real - time applications. ### Summary 6DGS solves the problems of requiring an initial pose and an iterative process in traditional methods through innovative ray generation and attention mechanisms, achieving more efficient and accurate 6D camera pose estimation. This method has significant advantages in practical applications, especially in the absence of initial - pose priors.