Abstract:Visual place recognition is a challenging problem in robotics and autonomous systems because the scene undergoes appearance and viewpoint changes in a changing world. Existing state-of-the-art methods heavily rely on CNN-based architectures. However, CNN cannot effectively model image spatial structure information due to the inherent locality. To address this issue, this paper proposes a novel Transformer-based place recognition method to combine local details, spatial context, and semantic information for image feature embedding. Firstly, to overcome the inherent locality of the convolutional neural network (CNN), a hybrid CNN-Transformer feature extraction network is introduced. The network utilizes the feature pyramid based on CNN to obtain the detailed visual understanding, while using the vision Transformer to model image contextual information and aggregate task-related features dynamically. Specifically, the multi-level output tokens from the Transformer are fed into a single Transformer encoder block to fuse multi-scale spatial information. Secondly, to acquire the multi-scale semantic information, a global semantic NetVLAD aggregation strategy is constructed. This strategy employs semantic enhanced NetVLAD, imposing prior knowledge on the terms of the Vector of Locally Aggregated Descriptors (VLAD), to aggregate multi-level token maps, and further concatenates the multi-level semantic features globally. Finally, to alleviate the disadvantage that the fixed margin of triplet loss leads to the suboptimal convergence, an adaptive triplet loss with dynamic margin is proposed. Extensive experiments on public datasets show that the learned features are robust to appearance and viewpoint changes and achieve promising performance compared to state-of-the-arts.

Regressing Transformers for Data-efficient Visual Place Recognition

Explicit Feature Disentanglement for Visual Place Recognition Across Appearance Changes

Leveraging Local Planar Motion Property for Robust Visual Matching and Localization.

Data-efficient Large Scale Place Recognition with Graded Similarity Supervision

ETR: An Efficient Transformer for Re-ranking in Visual Place Recognition

Hybrid CNN-Transformer Features for Visual Place Recognition

PlaceFormer: Transformer-based Visual Place Recognition using Multi-Scale Patch Selection and Fusion

$R^{2}$Former: Unified $R$etrieval and $R$eranking Transformer for Place Recognition

EigenPlaces: Training Viewpoint Robust Models for Visual Place Recognition

OverlapTransformer: An Efficient and Rotation-Invariant Transformer Network for LiDAR-Based Place Recognition

Place recognition in gardens by learning visual representations: data set and benchmark analysis

OverlapTransformer: An Efficient and Yaw-Angle-Invariant Transformer Network for LiDAR-Based Place Recognition

Localizing Discriminative Visual Landmarks for Place Recognition

LF-ViT: Reducing Spatial Redundancy in Vision Transformer for Efficient Image Recognition

CVTNet: A Cross-View Transformer Network for Place Recognition Using LiDAR Data

TransVPR: Transformer-based Place Recognition with Multi-Level Attention Aggregation

TReR: A Lightweight Transformer Re-Ranking Approach for 3D LiDAR Place Recognition

RegFormer: An Efficient Projection-Aware Transformer Network for Large-Scale Point Cloud Registration

SeqOT: A Spatial-Temporal Transformer Network for Place Recognition Using Sequential LiDAR Data

Learning robust representation and sequence constraint for retrieval-based long-term visual place recognition