Generative Adversarial and Self-Attention Based Fine-Grained Cross-Media Retrieval

Jin Hong,Haonan Luo,Yazhou Yao,Zhenmin Tang
DOI: https://doi.org/10.1145/3448823.3448825
2020-01-01
Abstract:Deep convolutional neural networks have recently demonstrated an impressive ability to conduct the task of fine-grained cross-media retrieval. However, existing fine-grained cross-media retrieval algorithms offer comparatively low retrieval accuracy and are difficult to apply in practice because of three challenging difficulties. Firstly, videos contain many noise frames which may affect the extraction of features. Secondly, existing algorithms deal with different modalities in an indiscriminative way, which ignore the characteristic of each modality, for example, the sequence characteristic of the text. Thirdly, the lack of joint semantic space learning limits retrieval accuracy. To overcome the drawbacks, we propose a novel fine-grained cross-media algorithm, which is based on the generative adversarial network and self-attention mechanism. Our approach firstly removes noise frames in the videos by a spatial cluster filtering algorithm to obtain more pure video data. Then we extract features of each modality. It should be noted that text features are extracted by a self-attention based LSTM structure. Finally, a generative adversarial network is used to learn the common semantic space for features of all modalities. Experimental evaluations on a new benchmark FGCorssNet demonstrate the improving results compared to other counterpart methods. The source codes, models, and data have been made anonymously available at https://github.com/gasanet/GASA.
What problem does this paper attempt to address?