Cross-modal Semantic Interference Suppression for image-text matching
Tao Yao,Shouyong Peng,Yujuan Sun,Guorui Sheng,Haiyan Fu,Xiangwei Kong
DOI: https://doi.org/10.1016/j.engappai.2024.108005
IF: 8
2024-07-01
Engineering Applications of Artificial Intelligence
Abstract:Image-text matching, which aims at precisely measuring the visual-semantic similarities between images and texts, is a fundamental research topic in multimedia analysis domain. Current methods have obtained an impressive performance by taking advantage of Transformer architecture. However, most of them only consider inter-modal relationships to mine the image-text semantic correspondences, which makes them hard to accurately measure the similarity when facing similar images and text due to the cross-modal semantic interferences. In this work, to tackle the issue mentioned above, we propose a Cross-Modal Semantic Interference Suppression (CMSIS) method, which incorporates intra-modal fine-grained semantics and unmatched segments to suppress the semantic influences caused by similar heterogeneous data points. The intra-modal fine-grained semantics are utilized to push similar images or text away in the learned latent embedding space for better matching results. To further suppress the cross-modal semantic interferences among similar data points, the unmatched segments that can provide explicit clues to distinguish similar images or text, is also adopted. Experimental results on two popular multimodal datasets have demonstrated that the proposed CMSIS outperforms a range of baselines.
automation & control systems,computer science, artificial intelligence,engineering, electrical & electronic, multidisciplinary