BCRA: bidirectional cross-modal implicit relation reasoning and aligning for text-to-image person retrieval

Zhaoqi Li,Yongping Xie
DOI: https://doi.org/10.1007/s00530-024-01372-2
IF: 3.9
2024-06-16
Multimedia Systems
Abstract:Text-to-image person retrieval aims to retrieve relevant target individuals based on given textual descriptions. The main challenge faced by this task is how to better combine and align the features of both text and image modalities. Previous efforts have attempted to introduce masked language model (MLM) to implicitly enhance the capability of multimodal representation, making some progress. However, masked image model (MIM) seems to be underestimated in this task. Therefore, we propose BCRA: a bidirectional cross modal implicit relationship inference and alignment framework, introducing MIM as a supplement to MLM tasks. Firstly, we integrate the tasks of MIM and MLM. Building upon this foundation, in order to enhance multimodal interaction, we further investigated the impact of global/local visual features on MLM tasks and constructed a new cross attention module. Additionally, we observe that image masks and language masks themselves serve as powerful means for data augmentation. We directly employ masked data from other modules during model training, engaging in cross-modal multi-view learning. The introduction of bidirectional mask strategy features in conjunction with other modules improves the accuracy and robustness of the model. The proposed approach achieves state-of-the-art results on all three public datasets, and compared to existing methods, it has the advantages of fast speed, fewer parameters, and no dependence on additional datasets.
computer science, information systems, theory & methods
What problem does this paper attempt to address?