Deep Attentional Fine-Grained Similarity Network with Adversarial Learning for Cross-Modal Retrieval

Cheng Qingrong,Gu Xiaodong
DOI: https://doi.org/10.1007/s11042-020-09450-z
IF: 2.577
2020-01-01
Multimedia Tools and Applications
Abstract:People have witnessed the swift development of multimedia devices and multimedia technologies in recent years. How to catch interesting and highly relevant information from the magnanimous multimedia data becomes an urgent and challenging matter. To obtain more accurate retrieval results, researchers naturally think of using more fine-grained features to evaluate the similarity among multimedia samples. In this paper, we propose a Deep Attentional Fine-grained Similarity Network (DAFSN) for cross-modal retrieval, which is optimized in an adversarial learning manner. The DAFSN model consists of two subnetworks, attentional fine-grained similarity network for aligned representation learning and modal discriminative network. The front subnetwork adopts Bi-directional Long Short-Term Memory (LSTM) and pre-trained Inception-v3 model to extract text features and image features. In aligned representation learning, we consider not only the sentence-level pair-matching constraint but also the fine-grained similarity between word-level features of text description and sub-regional features of an image. The modal discriminative network aims to minimize the “heterogeneity gap” between text features and image features in an adversarial manner. We do experiments on several widely used datasets to verify the performance of the proposed DAFSN. The experimental results show that the DAFSN obtains better retrieval results based on the MAP metric. Besides, the result analyses and visual comparisons are presented in the experimental section.
What problem does this paper attempt to address?