Abstract:Text-based person retrieval aims to find interest person images based on textual descriptions. The primary challenge in this task stems from the semantic gap resulting from the difference in feature granularity between text (which is characterized by coarse-grained features) and images (which are known for their fine-grained features). Previous works have utilized attention mechanisms to align modalities or to acquire a uniform representation, aiming to bridge the semantic gap between text and images. However, these methods suffer from two limitations: 1) Attention-based methods overlook subtle yet valuable information. 2) There exists a significant granularity gap between modalities, making the learning of a uniform representation time-consuming. To address these issues, we propose a Modal Complementarity framework based on Multimodal Large Language Model (MLLM-MC), which designed prompts according to task characteristics and utilized the multimodal abilities of Multimodal Large Language Model (MLLM) to produce elaborate textual descriptions for images. The textual descriptions generated by MLLM are used as a complement to the visual modality, thereby expanding the text-to-image retrieval task to encompass text-to-composite-image retrieval. To extract more comprehensive feature information, MLLM-MC employs a dual-stream model structure, which incorporates separate feature extractors for both visual and textual modalities. These extractors are further categorized into basic and detailed extractors, enabling the capture of information across different levels of granularity. Furthermore, in order to address the modal gap, we propose an uncertainty modeling technique within the visual branch, aiming to improve the model's matching patterns from one-to-one to one-to-many manner. The features from modal fusion are aligned using a transformer-based fusion module and low-order multimodal alignment. We conducted extensive experiments on three public datasets to evaluate the proposed MLLM-MC, achieving competitive Rank-1 accuracy of 68.58%, 62.66%, and 52.50% on CUHK-PEDES, ICFG-PEDES, and RSTPReid, respectively.

Modal Complementarity Based on Multimodal Large Language Model for Text-Based Person Retrieval

MM-Embed: Universal Multimodal Retrieval with Multimodal LLMs

Multi-path Exploration and Feedback Adjustment for Text-to-Image Person Retrieval

Towards Unified Text-based Person Retrieval: A Large-scale Multi-Attribute and Language Search Benchmark

Adaptive Uncertainty-Based Learning for Text-Based Person Retrieval

Hybrid Attention Network for Language-Based Person Search

Text-based person search via cross-modal alignment learning

Toward Robust Multimodal Learning using Multimodal Foundational Models

MoTaDual: Modality-Task Dual Alignment for Enhanced Zero-shot Composed Image Retrieval

BCRA: bidirectional cross-modal implicit relation reasoning and aligning for text-to-image person retrieval

Multi-Modal Retrieval For Large Language Model Based Speech Recognition

2M-NER: Contrastive Learning for Multilingual and Multimodal NER with Language and Modal Fusion

SCMM: Calibrating Cross-modal Representations for Text-Based Person Search

Enhance the Robustness of Text-Centric Multimodal Alignments

Improving Cross-Modal Image-Text Retrieval With Teacher-Student Learning

Attribute-Aware Implicit Modality Alignment for Text Attribute Person Search

Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image Person Retrieval

Text-centric Alignment for Multi-Modality Learning

Model Composition for Multimodal Large Language Models

What Makes for Robust Multi-Modal Models in the Face of Missing Modalities?

On the Hidden Mystery of OCR in Large Multimodal Models