DSGSR - Dynamic Semantic Generation and Similarity Reasoning for Image-Text Matching.

Xiaojing Li,Bin Wang,Xiaohong Zhang,Xiaochun Yang
DOI: https://doi.org/10.1007/978-3-030-93046-2_15
2021-01-01
Abstract:Cross-modal image-text matching is vital for building visual and language relationship. The biggest challenge is to eliminate the heterogeneity in image and text. Existing fine-grained image-text matching methods make great progress in exploring fine-grained correspondence. However, they only use the Cross-Attention method, which ignores the importance of image region semantics and dynamic image-text matching. In this paper, we propose a novel Dynamic Semantic Generation and Similarity Reasoning (DSGSR) network model for image-text matching. Specifically, we use intra-modal relations to enrich the regional features of the image. Then, in consideration of dynamic cross-modal matching, we dynamically generate the query text or image representation according to the retrieved image or text representation. We also introduce the Graph Convolutional Network (GCN) to deal with the effect of neighbor node information on matching accuracy when measuring the image-text similarity. A large number of experiments and analyses show that the DSGSR model surpass state-of-the-art methods on Flickr30K and MSCOCO datasets.
What problem does this paper attempt to address?