Learning Joint Multimodal Representation Based On Multi-Fusion Deep Neural Networks

Zepeng Gu,Bo Lang,Tongyu Yue,Lei Huang
DOI: https://doi.org/10.1007/978-3-319-70096-0_29
2017-01-01
Abstract:Recently, learning joint representation of multimodal data has received more and more attentions. Multimodal features are concept-level compositive features which are more effective than those single-modality features. Most existing methods only mine interactions between modalities on the top of their networks for one time to learn multi-modal representation. In this paper, we propose a multi-fusion deep learning framework which learns multimodal features richer in semantic. The framework sets multiple fusing points in different level of feature spaces, and then integrates and passes the fusing information step by step from the low level to higher levels. Moreover, we propose a multi-channel decoding network with alternate fine-tuning strategy to fully mine the modality-specific information and cross-modality correlations. We are also the first to introduce deep learning features into multimodal deep learning, alleviating the semantic and statistical property differences between modalities to learn better features. Extensive experiments on real-world datasets demonstrate that, our proposed method achieves superior performance compared with the state-of-the-art methods.
What problem does this paper attempt to address?