Abstract:Visual place recognition (VPR) remains challenging due to significant viewpoint changes and appearance variations. Mainstream works tackle these challenges by developing various feature aggregation methods to transform deep features into robust and compact global representations. Unfortunately, satisfactory results cannot be achieved under challenging conditions. We start from a new perspective and attempt to build a discriminative global representations by fusing image data and text descriptions of the the visual scene. The motivation is twofold: (1) Current Large Vision-Language Models (LVLMs) demonstrate extraordinary emergent capability in visual instruction following, and thus provide an efficient and flexible manner in generating text descriptions of images; (2) The text descriptions, which provide high-level scene understanding, show strong robustness against environment variations. Although promising, leveraging LVLMs to build multi-modal VPR solutions remains challenging in efficient multi-modal fusion. Furthermore, LVLMs will inevitably produces some inaccurate descriptions, making it even harder. To tackle these challenges, we propose a novel multi-modal VPR solution. It first adapts pre-trained visual and language foundation models to VPR for extracting image and text features, which are then fed into the feature combiner to enhance each other. As the main component, the feature combiner first propose a token-wise attention block to adaptively recalibrate text tokens according to their relevance to the image data, and then develop an efficient cross-attention fusion module to propagate information across different modalities. The enhanced multi-modal features are compressed into the feature descriptor for performing retrieval. Experimental results show that our method outperforms state-of-the-art methods by a large margin with significantly smaller image descriptor dimension.

MS-NetVLAD: Multi-Scale NetVLAD for Visual Place Recognition

MultiRes-NetVLAD: Augmenting Place Recognition Training with Low-Resolution Imagery

Ghost-dil-NetVLAD: A Lightweight Neural Network for Visual Place Recognition

NetVLAD: CNN Architecture for Weakly Supervised Place Recognition

AVFP-MVX: Multimodal VoxelNet with Attention Mechanism and Voxel Feature Pyramid

Convolutional MLP orthogonal fusion of multiscale features for visual place recognition

Contextual Patch-NetVLAD: Context-Aware Patch Feature Descriptor and Patch Matching Mechanism for Visual Place Recognition

CVTNet: A Cross-View Transformer Network for Place Recognition Using LiDAR Data

SeqNetVLAD vs PointNetVLAD: Image Sequence vs 3D Point Clouds for Day-Night Place Recognition

VLAD-BuFF: Burst-aware Fast Feature Aggregation for Visual Place Recognition

Spatial Pyramid-Enhanced NetVLAD With Weighted Triplet Loss for Place Recognition

LLR-MVSNet: a lightweight network for low-texture scene reconstruction

PointNetVLAD: Deep Point Cloud Based Retrieval for Large-Scale Place Recognition

Image Retrieval via Gated Multiscale NetVLAD for Social Media Applications

Towards Seamless Adaptation of Pre-trained Models for Visual Place Recognition

Variational Structured Attention Networks for Deep Visual Representation Learning

LoCS-Net: Localizing Convolutional Spiking Neural Network for Fast Visual Place Recognition

SVT-Net: Super Light-Weight Sparse Voxel Transformer for Large Scale Place Recognition

N2MVSNet: Non-Local Neighbors Aware Multi-View Stereo Network

LVP-net: A deep network of learning visual pathway for edge detection

LVLM-empowered Multi-modal Representation Learning for Visual Place Recognition