Abstract:Visual place recognition (VPR) involves obtaining robust image descriptors to cope with differences in camera viewpoints and drastic external environment changes. Utilizing multiscale features improves the robustness of image descriptors; however, existing methods neither exploit the multiscale features generated during feature extraction nor consider the feature redundancy problem when fusing multiscale information when image descriptors are enhanced. We propose a novel encoding strategy—convolutional multilayer perceptron orthogonal fusion of multiscale features (ConvMLP-OFMS)—for VPR. A ConvMLP is used to obtain robust and generalized global image descriptors and the multiscale features generated during feature extraction are used to enhance the global descriptors to cope with changes in the environment and viewpoints. Additionally, an attention mechanism is used to eliminate noise and redundant information. Compared to traditional methods that use tensor splicing for feature fusion, we introduced matrix orthogonal decomposition to eliminate redundant information. Experiments demonstrated that the proposed architecture outperformed NetVLAD, CosPlace, ConvAP, and other methods. On the Pittsburgh and MSLS datasets, which contained significant viewpoint and illumination variations, our method achieved 92.5% and 86.5% Recall@1, respectively. We also achieved good performances—80.6% and 43.2%—on the SPED and NordLand datasets, respectively, which have more extreme illumination and appearance variations.

What problem does this paper attempt to address?

The paper aims to address the key challenges in Visual Place Recognition (VPR), specifically how to obtain robust and generalized image descriptors when facing different camera viewpoints and significant external environmental changes. Specifically, the research focuses on the following aspects: 1. **Enhancing the robustness of image descriptors using multi-scale features**: Existing methods often fail to fully utilize the multi-scale features generated during the feature extraction process and do not consider the issue of feature redundancy when fusing multi-scale information to enhance image descriptors. 2. **Proposing a novel encoding strategy**: Convolutional Multi-Layer Perceptron Orthogonal Fusion of Multi-Scale features (ConvMLP-OFMS) aims to address the above issues. This method obtains robust and generalized global image descriptors through Convolutional Multi-Layer Perceptron (ConvMLP) and enhances these global descriptors using the multi-scale features generated during the feature extraction process to cope with environmental and viewpoint changes. 3. **Introducing an attention mechanism to eliminate noise and redundant information**: An attention mechanism is used to remove noise and redundant information from the multi-scale information, further optimizing the quality of the descriptors. 4. **Adopting matrix orthogonal decomposition to eliminate redundant information**: Compared to the traditional method of feature fusion using tensor concatenation, the method introduced in this paper employs matrix orthogonal decomposition technology to eliminate redundant information. 5. **Experimental validation**: Experiments were conducted on multiple benchmark datasets, including Pittsburgh, MSLS, etc. The results show that the proposed architecture outperforms existing methods such as NetVLAD, CosPlace, ConvAP, and performs exceptionally well in handling significant viewpoint and illumination changes. In summary, this paper proposes a novel VPR architecture that aims to improve the robustness and generalization ability of image descriptors by efficiently utilizing multi-scale features and combining attention mechanisms with orthogonal projection decomposition technology.

Convolutional MLP orthogonal fusion of multiscale features for visual place recognition

BEV^2PR: BEV-Enhanced Visual Place Recognition with Structural Cues

Visual Place Recognition Based on Multilevel Descriptors for the Visually Impaired People

LVLM-empowered Multi-modal Representation Learning for Visual Place Recognition

MultiRes-NetVLAD: Augmenting Place Recognition Training with Low-Resolution Imagery

CricaVPR: Cross-image Correlation-aware Representation Learning for Visual Place Recognition

MVC-VPR: Mutual Learning of Viewpoint Classification and Visual Place Recognition

Hybrid CNN-Transformer Features for Visual Place Recognition

LCPR: A Multi-Scale Attention-Based LiDAR-Camera Fusion Network for Place Recognition

MixVPR: Feature Mixing for Visual Place Recognition

A Multi-Domain Feature Learning Method for Visual Place Recognition

DMPCANet: A Low Dimensional Aggregation Network for Visual Place Recognition

PRFusion: Toward Effective and Robust Multi-Modal Place Recognition with Image and Point Cloud Fusion

Towards Seamless Adaptation of Pre-trained Models for Visual Place Recognition

Dynamic Convolution Covariance Network Using Multi-Scale Feature Fusion for Remote Sensing Scene Image Classification

STA-VPR: Spatio-temporal Alignment for Visual Place Recognition

DINO-Mix: Enhancing Visual Place Recognition with Foundational Vision Model and Feature Mixing

MambaPlace:Text-to-Point-Cloud Cross-Modal Place Recognition with Attention Mamba Mechanisms

MS-NetVLAD: Multi-Scale NetVLAD for Visual Place Recognition

EDTformer: An Efficient Decoder Transformer for Visual Place Recognition