Abstract:Visual place recognition (VPR) remains challenging due to significant viewpoint changes and appearance variations. Mainstream works tackle these challenges by developing various feature aggregation methods to transform deep features into robust and compact global representations. Unfortunately, satisfactory results cannot be achieved under challenging conditions. We start from a new perspective and attempt to build a discriminative global representations by fusing image data and text descriptions of the the visual scene. The motivation is twofold: (1) Current Large Vision-Language Models (LVLMs) demonstrate extraordinary emergent capability in visual instruction following, and thus provide an efficient and flexible manner in generating text descriptions of images; (2) The text descriptions, which provide high-level scene understanding, show strong robustness against environment variations. Although promising, leveraging LVLMs to build multi-modal VPR solutions remains challenging in efficient multi-modal fusion. Furthermore, LVLMs will inevitably produces some inaccurate descriptions, making it even harder. To tackle these challenges, we propose a novel multi-modal VPR solution. It first adapts pre-trained visual and language foundation models to VPR for extracting image and text features, which are then fed into the feature combiner to enhance each other. As the main component, the feature combiner first propose a token-wise attention block to adaptively recalibrate text tokens according to their relevance to the image data, and then develop an efficient cross-attention fusion module to propagate information across different modalities. The enhanced multi-modal features are compressed into the feature descriptor for performing retrieval. Experimental results show that our method outperforms state-of-the-art methods by a large margin with significantly smaller image descriptor dimension.

Multi-modal Intent Detection with LVAMoE: the Language-Visual-Audio Mixture of Experts

MoVA: Adapting Mixture of Vision Experts to Multimodal Context

An Effective Multimodal Representation and Fusion Method for Multimodal Intent Recognition

MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models

EVLM: An Efficient Vision-Language Model for Visual Understanding

MMoE: Enhancing Multimodal Models with Mixtures of Multimodal Interaction Experts

VLP2MSA: Expanding Vision-Language Pre-Training to Multimodal Sentiment Analysis

Octavius: Mitigating Task Interference in MLLMs via LoRA-MoE

VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts

Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts

Multimodal Language Analysis with Recurrent Multistage Fusion

Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models

LVLM-empowered Multi-modal Representation Learning for Visual Place Recognition

MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

Video Emotion Open-vocabulary Recognition Based on Multimodal Large Language Model

E5-V: Universal Embeddings with Multimodal Large Language Models

A Multimodal Sentiment Analysis Approach Based on a Joint Chained Interactive Attention Mechanism

A Multi-Modal ELMo Model for Image Sentiment Recognition of Consumer Data

MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval

AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning

MoExtend: Tuning New Experts for Modality and Task Extension