Abstract:Visual place recognition (VPR) aims to determine the general geographical location of a query image by retrieving visually similar images from a large geo-tagged database. To obtain a global representation for each place image, most approaches typically focus on the aggregation of deep features extracted from a backbone through using current prominent architectures (e.g., CNNs, MLPs, pooling layer and transformer encoder), giving little attention to the transformer decoder. However, we argue that its strong capability in capturing contextual dependencies and generating accurate features holds considerable potential for the VPR task. To this end, we propose an Efficient Decoder Transformer (EDTformer) for feature aggregation, which consists of several stacked simplified decoder blocks followed by two linear layers to directly generate robust and discriminative global representations for VPR. Specifically, we do this by formulating deep features as the keys and values, as well as a set of independent learnable parameters as the queries. EDTformer can fully utilize the contextual information within deep features, then gradually decode and aggregate the effective features into the learnable queries to form the final global representations. Moreover, to provide powerful deep features for EDTformer and further facilitate the robustness, we use the foundation model DINOv2 as the backbone and propose a Low-Rank Parallel Adaptation (LoPA) method to enhance it, which can refine the intermediate features of the backbone progressively in a memory- and parameter-efficient way. As a result, our method not only outperforms single-stage VPR methods on multiple benchmark datasets, but also outperforms two-stage VPR methods which add a re-ranking with considerable cost. Code will be available at <a class="link-external link-https" href="https://github.com/Tong-Jin01/EDTformer" rel="external noopener nofollow">this https URL</a>.

LSDNet: A Lightweight Self-Attentional Distillation Network for Visual Place Recognition

Ghost-dil-NetVLAD: A Lightweight Neural Network for Visual Place Recognition

Spatial Likelihood Voting with Self-Knowledge Distillation for Weakly Supervised Object Detection.

RINet: Efficient 3D Lidar-Based Place Recognition Using Rotation Invariant Neural Network

LWRN: Light-Weight Residual Network for Edge Detection

MultiRes-NetVLAD: Augmenting Place Recognition Training with Low-Resolution Imagery

Design Space Exploration of Low-Bit Quantized Neural Networks for Visual Place Recognition

MS-NetVLAD: Multi-Scale NetVLAD for Visual Place Recognition

LWR-Net: Robust and Lightweight Place Recognition Network for Noisy and Low-Density Point Clouds

DistilVPR: Cross-Modal Knowledge Distillation for Visual Place Recognition

Simple and Effective Visual Place Recognition Via Spiking Neural Networks and Deep Information

A Training-Free, Lightweight Global Image Descriptor for Long-Term Visual Place Recognition Toward Autonomous Vehicles

DMPCANet: A Low Dimensional Aggregation Network for Visual Place Recognition

LoCS-Net: Localizing Convolutional Spiking Neural Network for Fast Visual Place Recognition

Spatial Pyramid-Enhanced NetVLAD With Weighted Triplet Loss for Place Recognition

LVLM-empowered Multi-modal Representation Learning for Visual Place Recognition

LRSDet: Lightweight Remote Sensing Target Detection Network with Local-Global Information Awareness and Rapid Sample Assignment

Towards Seamless Adaptation of Pre-trained Models for Visual Place Recognition

LLR-MVSNet: a lightweight network for low-texture scene reconstruction

Event-VPR: End-to-End Weakly Supervised Deep Network Architecture for Visual Place Recognition using Event-based Vision Sensor

EDTformer: An Efficient Decoder Transformer for Visual Place Recognition