Dual Relation-Aware Synergistic Attention Network for Image-Text Matching

Shanshan Qi,Luxi Yang,Chunguo Li,Yongming Huang
DOI: https://doi.org/10.1109/icccas55266.2022.9824715
2022-01-01
Abstract:The image and text matching task plays an essential role in bringing the semantic gulf between language and vision. It still remains challenging since previous methods lack a detailed comprehension of the contextual interplays reflected in diverse visual relationships between objects. In this work, we present one novel dual relation-aware synergistic attention (DRSA) network to produce visual representations that incorporate crucial semantic concepts and salient objects of image scenes. First, we construct each image as two subgraphs and perform the multi-type interobject interactions utilizing the sentence-guided graph attention mechanism. Specially, two types of visual relations are exploited: Implicit Relations extracting the latent dynamics between objects and Explicit Relations capturing the semantic dependencies and relative geometric positions. Second, a synergistic fusion module is designed to adaptively merge the implicit, explicit, and all-mixed relation features based on the sentence contexts, which works like multi-head attention. Third, adversarial learning is conducted to reinforce the interaction of implicit and explicit relation encoding modules to explore more effective multi-view associations between images and sentences. Experiments show that DRSA outperforms existing state-of-the-art approaches on two widely used datasets (i.e., MSCOCO, Flickr30K, proving the efficacy of our proposed elaborate matching technique.
What problem does this paper attempt to address?