Two-Stream Convolutional Neural Network For Multimodal Matching

Youcai Zhang,Yiwei Gu,Xiaodong Gu
DOI: https://doi.org/10.1007/978-3-030-01418-6_2
2018-01-01
Abstract:Mulitimudal matching aims to establish relationship across different modalities such as image and text. Existing works mainly focus on maximizing the correlation between feature vectors extracted from the off-the-shelf models. The feature extraction and the matching are two-stage learning process. This paper presents a novel two-stream convolutional neural network that integrates the feature extraction and the matching under an end-to-end manner. Visual and textual stream are designed for feature extraction and then are concatenated with multiple shared layers for multimodal matching. The network is trained using an extreme multiclass classification loss by viewing each multimodal data as a class. Then a finetuning step is performed by a ranking constraint. Experimental results on Flickr30k datasets demonstrate the effectiveness of the proposed network for multimodal matching.
What problem does this paper attempt to address?