Modal Complementarity Based on Multimodal Large Language Model for Text-Based Person Retrieval

Tong Bao,Tong Xu,Derong Xu,Zhi Zheng
DOI: https://doi.org/10.1007/978-981-97-7232-2_18
2024-01-01
Abstract:Text-based person retrieval aims to find interest person images based on textual descriptions. The primary challenge in this task stems from the semantic gap resulting from the difference in feature granularity between text (which is characterized by coarse-grained features) and images (which are known for their fine-grained features). Previous works have utilized attention mechanisms to align modalities or to acquire a uniform representation, aiming to bridge the semantic gap between text and images. However, these methods suffer from two limitations: 1) Attention-based methods overlook subtle yet valuable information. 2) There exists a significant granularity gap between modalities, making the learning of a uniform representation time-consuming. To address these issues, we propose a Modal Complementarity framework based on Multimodal Large Language Model (MLLM-MC), which designed prompts according to task characteristics and utilized the multimodal abilities of Multimodal Large Language Model (MLLM) to produce elaborate textual descriptions for images. The textual descriptions generated by MLLM are used as a complement to the visual modality, thereby expanding the text-to-image retrieval task to encompass text-to-composite-image retrieval. To extract more comprehensive feature information, MLLM-MC employs a dual-stream model structure, which incorporates separate feature extractors for both visual and textual modalities. These extractors are further categorized into basic and detailed extractors, enabling the capture of information across different levels of granularity. Furthermore, in order to address the modal gap, we propose an uncertainty modeling technique within the visual branch, aiming to improve the model's matching patterns from one-to-one to one-to-many manner. The features from modal fusion are aligned using a transformer-based fusion module and low-order multimodal alignment. We conducted extensive experiments on three public datasets to evaluate the proposed MLLM-MC, achieving competitive Rank-1 accuracy of 68.58%, 62.66%, and 52.50% on CUHK-PEDES, ICFG-PEDES, and RSTPReid, respectively.
What problem does this paper attempt to address?