Abstract:Robots rely on visual relocalization to estimate their pose from camera images when they lose track. One of the challenges in visual relocalization is repetitive structures in the operation environment of the robot. This calls for probabilistic methods that support multiple hypotheses for robot's pose. We propose such a probabilistic method to predict the posterior distribution of camera poses given an observed image. Our proposed training strategy results in a generative model of camera poses given an image, which can be used to draw samples from the pose posterior distribution. Our method is streamlined and well-founded in theory and outperforms existing methods on localization in presence of ambiguities.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve a key challenge in robot visual relocalization, namely the visual ambiguity problem caused by repetitive structures in the environment. Specifically: 1. **Background of visual relocalization**: Robots rely on visual relocalization to estimate their position and pose in the environment, especially when tracking is lost. However, the presence of repetitive structures in the operating environment (such as stairs, similar chairs or ceiling panels) can cause different camera poses to record the same observations, making it difficult for traditional unimodal methods to accurately estimate the pose. 2. **Limitations of existing methods**: - Most existing visual relocalization methods focus on finding the best - matching pose, but these methods perform poorly when there are multiple possible poses. - Some probabilistic methods can handle multimodal distributions, but they usually need to know the number of modes in the target distribution in advance, which limits their generality and flexibility. 3. **The solution proposed in this paper**: - The authors propose a probabilistic method based on Conditional Variational Autoencoders (CVAE) to predict the posterior distribution \( p(y|x) \) of the camera pose given an image, where \( y\in SE(3) \) represents the camera pose and \( x\in\mathbb{R}^{H\times W\times3} \) represents the image. - By learning the mapping from the image to the pose space, this method can generate multiple reasonable pose hypotheses in the presence of ambiguity without the need to pre - specify the number of modes. 4. **Main contributions**: - A generative model is proposed that can sample from the posterior distribution to deal with visual ambiguity. - A training strategy is designed that uses conditional variational autoencoders to learn the ambiguity space of camera poses in the scene. - Through experimental verification, this method is superior to existing methods in dealing with visual ambiguity. ### Formula representation - The posterior distribution of the camera pose \( y \) is \( p(y|x) \), where \( y\in SE(3) \). - The input image \( x\in\mathbb{R}^{H\times W\times3} \). - The latent variable \( z\sim\mathcal{N}(0, I) \) is used to generate pose samples \( y \). ### Summary By introducing conditional variational autoencoders, this paper provides a novel and theoretically - well - founded method that can effectively handle the ambiguity problem in visual relocalization without the need to know the number of modes in advance. This method not only improves robustness in complex environments but also shows superior performance in different scenarios.

Conditional Variational Autoencoders for Probabilistic Pose Regression

CVAM-Pose: Conditional Variational Autoencoder for Multi-Object Monocular Pose Estimation

Conditional Variational Autoencoder for Learned Image Reconstruction

A Variational Observation Model of 3D Object for Probabilistic Semantic SLAM

Variational Bayesian Approach to Condition-Invariant Feature Extraction for Visual Place Recognition

View-Invariant, Occlusion-Robust Probabilistic Embedding for Human Pose

Towards Visual Ego-motion Learning in Robots

Probabilistic Visual Place Recognition for Hierarchical Localization

PDQ-Net: Deep probabilistic dual quaternion network for absolute pose regression on SE(3).

Deep Directional Statistics: Pose Estimation with Uncertainty Quantification

Pose Generator ( G ) : Head : R arm : L arm : Chest : R leg : L leg Plausible Pose

Nothing But Geometric Constraints: A Model-Free Method for Articulated Object Pose Estimation

Learning Variational Motion Prior for Video-based Motion Capture

Visual Odometry with Deep Bidirectional Recurrent Neural Networks.

3D Neural Embedding Likelihood: Probabilistic Inverse Graphics for Robust 6D Pose Estimation

Lightweight, Uncertainty-Aware Conformalized Visual Odometry

V-VIPE: Variational View Invariant Pose Embedding

Reloc3r: Large-Scale Training of Relative Camera Pose Regression for Generalizable, Fast, and Accurate Visual Localization

Uncertainty in latent representations of variational autoencoders optimized for visual tasks

Probabilistic Uncertainty Quantification of Prediction Models with Application to Visual Localization

Uncertainty-Aware Visual-Inertial SLAM with Volumetric Occupancy Mapping