Activating Self-Attention for Multi-Scene Absolute Pose Regression

Miso Lee,Jihwan Kim,Jae-Pil Heo
2024-11-18
Abstract:Multi-scene absolute pose regression addresses the demand for fast and memory-efficient camera pose estimation across various real-world environments. Nowadays, transformer-based model has been devised to regress the camera pose directly in multi-scenes. Despite its potential, transformer encoders are underutilized due to the collapsed self-attention map, having low representation capacity. This work highlights the problem and investigates it from a new perspective: distortion of query-key embedding space. Based on the statistical analysis, we reveal that queries and keys are mapped in completely different spaces while only a few keys are blended into the query region. This leads to the collapse of the self-attention map as all queries are considered similar to those few keys. Therefore, we propose simple but effective solutions to activate self-attention. Concretely, we present an auxiliary loss that aligns queries and keys, preventing the distortion of query-key space and encouraging the model to find global relations by self-attention. In addition, the fixed sinusoidal positional encoding is adopted instead of undertrained learnable one to reflect appropriate positional clues into the inputs of self-attention. As a result, our approach resolves the aforementioned problem effectively, thus outperforming existing methods in both outdoor and indoor scenes.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the poor performance of the self - attention mechanism in Multi - Scene Absolute Pose Regression (MS - APR). Specifically, the author points out that in the existing Transformer - based models, the self - attention module of the encoder causes the self - attention map to collapse due to the distortion of the query and key embedding spaces and the insufficient training of position encoding, thus affecting the performance of the model. ### Specific description of the problem 1. **Collapse of the self - attention map**: - In the existing multi - scene absolute pose regression models (such as MSTransformer), the self - attention module of the encoder fails to significantly improve performance and sometimes even degrades performance. - Statistical analysis shows that the query and key are mapped to completely different spaces, and only a few keys are integrated into the query area, resulting in the collapse of the self - attention map. This means that all queries are regarded as similar to a few keys, wasting the learning ability of the encoder. 2. **Insufficient training of position encoding**: - The existing learnable position encoding fails to fully activate the self - attention mechanism during the training process, making it difficult for the model to capture the global relationships between image features. - Using a fixed 2D sinusoidal positional encoding can provide more reliable position cues to help the model stably learn self - relationships from the very beginning. ### Solutions To solve the above problems, the author proposes the following methods: 1. **Query - Key Alignment Loss (QKA)**: - Introduce an auxiliary loss function \( L_{QKA} \), by forcing the centroids of the query and key to be close, ensuring that they interact in the embedding space. - The formula is defined as follows: \[ L_{QKA} = \frac{1}{L} \sum_{l = 1}^{L} \frac{1}{H} \sum_{h = 1}^{H}\| \bar{q}_l^h - \bar{k}_l^h \|^2 \] where \( L \) is the number of encoder layers, \( H \) is the number of heads, and \( \bar{q}_l^h \) and \( \bar{k}_l^h \) are the mean vectors of the query and key of the \( h \) - th head in the \( l \) - th layer respectively. 2. **Fixed position encoding**: - Use a fixed 2D sinusoidal positional encoding instead of the insufficiently trained learnable position encoding to ensure that the input query and key have reliable position information, thereby stably learning self - relationships. ### Experimental results Through experimental verification on multiple datasets, the author's method significantly improves the performance of the model. This is specifically manifested in the following aspects: - **Outdoor scenes (Cambridge Landmarks)**: In four outdoor scenes, after using the author's method, the average position and orientation errors of the model are significantly reduced. - **Indoor scenes (7Scenes)**: In seven indoor scenes, a similar performance improvement is also observed. - **Quantitative analysis**: By measuring the entropy of the self - attention map (attention entropy), it is verified that the author's method successfully activates the self - attention mechanism of the encoder and improves the representation quality of the self - attention map. In conclusion, this paper analyzes the problems of query - key embedding space distortion and insufficient position encoding training, and proposes a simple and effective method to activate the self - attention mechanism, thereby significantly improving the performance of the multi - scene absolute pose regression task.