Abstract:Text-based person retrieval aims to find interest person images based on textual descriptions. The primary challenge in this task stems from the semantic gap resulting from the difference in feature granularity between text (which is characterized by coarse-grained features) and images (which are known for their fine-grained features). Previous works have utilized attention mechanisms to align modalities or to acquire a uniform representation, aiming to bridge the semantic gap between text and images. However, these methods suffer from two limitations: 1) Attention-based methods overlook subtle yet valuable information. 2) There exists a significant granularity gap between modalities, making the learning of a uniform representation time-consuming. To address these issues, we propose a Modal Complementarity framework based on Multimodal Large Language Model (MLLM-MC), which designed prompts according to task characteristics and utilized the multimodal abilities of Multimodal Large Language Model (MLLM) to produce elaborate textual descriptions for images. The textual descriptions generated by MLLM are used as a complement to the visual modality, thereby expanding the text-to-image retrieval task to encompass text-to-composite-image retrieval. To extract more comprehensive feature information, MLLM-MC employs a dual-stream model structure, which incorporates separate feature extractors for both visual and textual modalities. These extractors are further categorized into basic and detailed extractors, enabling the capture of information across different levels of granularity. Furthermore, in order to address the modal gap, we propose an uncertainty modeling technique within the visual branch, aiming to improve the model's matching patterns from one-to-one to one-to-many manner. The features from modal fusion are aligned using a transformer-based fusion module and low-order multimodal alignment. We conducted extensive experiments on three public datasets to evaluate the proposed MLLM-MC, achieving competitive Rank-1 accuracy of 68.58%, 62.66%, and 52.50% on CUHK-PEDES, ICFG-PEDES, and RSTPReid, respectively.

Multi-path Exploration and Feedback Adjustment for Text-to-Image Person Retrieval

Text-based person search via cross-modal alignment learning

Adaptive Uncertainty-Based Learning for Text-Based Person Retrieval

Modal Complementarity Based on Multimodal Large Language Model for Text-Based Person Retrieval

Adaptive and Collaborative Multi-scale Alignment for Text-Based Person Search

Enhancing Visual Representation for Text-based Person Searching

Multi-granularity Separation Network for Text-Based Person Retrieval with Bidirectional Refinement Regularization.

See Finer, See More: Implicit Modality Alignment for Text-based Person Retrieval

DCEL: Deep Cross-modal Evidential Learning for Text-Based Person Retrieval

Hierarchical Gumbel Attention Network for Text-based Person Search

Cross-Modal Knowledge Adaptation for Language-Based Person Search

Multi‐level Cross‐modality Learning Framework for Text‐based Person Re‐identification

Knowing Where to Focus: Attention-Guided Alignment for Text-based Person Search

Text-Guided Visual Feature Refinement for Text-Based Person Search

Fine-Granularity Alignment for Text-Based Person Retrieval Via Semantics-Centric Visual Division

MGRL: Mutual-Guidance Representation Learning for Text-to-Image Person Retrieval.

Improving Text-based Person Search via Part-level Cross-modal Correspondence

Joint Token and Feature Alignment Framework for Text-Based Person Search.

Beat: Bi-directional One-to-Many Embedding Alignment for Text-based Person Retrieval

Image-Specific Information Suppression and Implicit Local Alignment for Text-based Person Search

Multi-level Part-aware Feature Disentangling for Text-based Person Search