Abstract:Multi-scene absolute pose regression addresses the demand for fast and memory-efficient camera pose estimation across various real-world environments. Nowadays, transformer-based model has been devised to regress the camera pose directly in multi-scenes. Despite its potential, transformer encoders are underutilized due to the collapsed self-attention map, having low representation capacity. This work highlights the problem and investigates it from a new perspective: distortion of query-key embedding space. Based on the statistical analysis, we reveal that queries and keys are mapped in completely different spaces while only a few keys are blended into the query region. This leads to the collapse of the self-attention map as all queries are considered similar to those few keys. Therefore, we propose simple but effective solutions to activate self-attention. Concretely, we present an auxiliary loss that aligns queries and keys, preventing the distortion of query-key space and encouraging the model to find global relations by self-attention. In addition, the fixed sinusoidal positional encoding is adopted instead of undertrained learnable one to reflect appropriate positional clues into the inputs of self-attention. As a result, our approach resolves the aforementioned problem effectively, thus outperforming existing methods in both outdoor and indoor scenes.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the poor performance of the self - attention mechanism in Multi - Scene Absolute Pose Regression (MS - APR). Specifically, the author points out that in the existing Transformer - based models, the self - attention module of the encoder causes the self - attention map to collapse due to the distortion of the query and key embedding spaces and the insufficient training of position encoding, thus affecting the performance of the model. ### Specific description of the problem 1. **Collapse of the self - attention map**: - In the existing multi - scene absolute pose regression models (such as MSTransformer), the self - attention module of the encoder fails to significantly improve performance and sometimes even degrades performance. - Statistical analysis shows that the query and key are mapped to completely different spaces, and only a few keys are integrated into the query area, resulting in the collapse of the self - attention map. This means that all queries are regarded as similar to a few keys, wasting the learning ability of the encoder. 2. **Insufficient training of position encoding**: - The existing learnable position encoding fails to fully activate the self - attention mechanism during the training process, making it difficult for the model to capture the global relationships between image features. - Using a fixed 2D sinusoidal positional encoding can provide more reliable position cues to help the model stably learn self - relationships from the very beginning. ### Solutions To solve the above problems, the author proposes the following methods: 1. **Query - Key Alignment Loss (QKA)**: - Introduce an auxiliary loss function \( L_{QKA} \), by forcing the centroids of the query and key to be close, ensuring that they interact in the embedding space. - The formula is defined as follows: \[ L_{QKA} = \frac{1}{L} \sum_{l = 1}^{L} \frac{1}{H} \sum_{h = 1}^{H}\| \bar{q}_l^h - \bar{k}_l^h \|^2 \] where \( L \) is the number of encoder layers, \( H \) is the number of heads, and \( \bar{q}_l^h \) and \( \bar{k}_l^h \) are the mean vectors of the query and key of the \( h \) - th head in the \( l \) - th layer respectively. 2. **Fixed position encoding**: - Use a fixed 2D sinusoidal positional encoding instead of the insufficiently trained learnable position encoding to ensure that the input query and key have reliable position information, thereby stably learning self - relationships. ### Experimental results Through experimental verification on multiple datasets, the author's method significantly improves the performance of the model. This is specifically manifested in the following aspects: - **Outdoor scenes (Cambridge Landmarks)**: In four outdoor scenes, after using the author's method, the average position and orientation errors of the model are significantly reduced. - **Indoor scenes (7Scenes)**: In seven indoor scenes, a similar performance improvement is also observed. - **Quantitative analysis**: By measuring the entropy of the self - attention map (attention entropy), it is verified that the author's method successfully activates the self - attention mechanism of the encoder and improves the representation quality of the self - attention map. In conclusion, this paper analyzes the problems of query - key embedding space distortion and insufficient position encoding training, and proposes a simple and effective method to activate the self - attention mechanism, thereby significantly improving the performance of the multi - scene absolute pose regression task.

Activating Self-Attention for Multi-Scene Absolute Pose Regression

Learning single and multi-scene camera pose regression with transformer encoders

Coarse-to-Fine Multi-Scene Pose Regression with Transformers

MRSAPose: Multi-level Routing Sparse Attention for Multi-Person Pose Estimation

Towards Precise 3D Human Pose Estimation with Multi-Perspective Spatial-Temporal Relational Transformers

Regression-Based Camera Pose Estimation through Multi-Level Local Features and Global Features

Poseur: Direct Human Pose Regression with Transformers.

Efficient Multi-Object Pose Estimation using Multi-Resolution Deformable Attention and Query Aggregation

Absolute Camera Pose Regression Using an RGB-D Dual-Stream Network and Handcrafted Base Poses

Global and Local Spatio-Temporal Encoder for 3D Human Pose Estimation

Geometry-Biased Transformer for Robust Multi-View 3D Human Pose Reconstruction

YOLOPose V2: Understanding and Improving Transformer-based 6D Pose Estimation

Shift Pose: A Lightweight Transformer-like Neural Network for Human Pose Estimation

HyperPose: Camera Pose Localization using Attention Hypernetworks

Exploiting multi-transformer encoder with multiple-hypothesis aggregation via diffusion model for 3D human pose estimation

Spatiotemporal correlation based self-adaptive pose estimation in complex scenes

Detecting and Grouping Keypoints for Multi-person Pose Estimation using Instance-Aware Attention

AiPE: A Novel Transformer-Based Pose Estimation Method

Correspondence Attention Transformer: A Context-sensitive Network for Two-view Correspondence Learning

OTPose: Occlusion-Aware Transformer for Pose Estimation in Sparsely-Labeled Videos