Scene-text aware cross-modal retrieval based on semantic matching (ChinaMM2024)

Suyan Cheng,Feifei Zhang,Xi Zhang,Zhuo Sun
DOI: https://doi.org/10.1007/s00530-024-01481-y
IF: 3.9
2024-09-23
Multimedia Systems
Abstract:In the real world, scene text, as an essential information medium, contains rich and intuitive information about natural scenes. Current cross-modal retrieval studies focus on establishing effective semantic links between images and texts. However, these studies often ignore this additional modality of scene text. Direct integration of scene text information may interfere with the model's understanding of the image–text relationship, resulting in degraded retrieval performance when dealing with scenes containing critical text information. To overcome this challenge, we propose a novel scene text-based image text retrieval network. This network enhances the understanding of visual semantics and improves retrieval performance by adding scene text as an additional modality. Specifically, We precisely measure the similarity between scene text and caption at both word and sentence levels. In addition, we apply a stacked cross attention mechanism to help the model recognize the most important words or sentences in the current scene, thus enhancing the overall understanding and retrieval of the scene. We also acquire global features of images and texts through multi-level pooling operations to strengthen the ability to capture hierarchical global information. Finally, we combine the similarity of the two aspects to derive the final retrieval results. Extensive experiments on the TextCap and CTC benchmark datasets show that our approach exhibits excellent performance compared to existing state-of-the-art methods.
computer science, information systems, theory & methods
What problem does this paper attempt to address?