Dynamic Interaction Networks for Image-Text Multimodal Learning

Wenshan Wang,Pengfei Liu,Su Yang,Weishan Zhang
DOI: https://doi.org/10.1016/j.neucom.2019.10.103
IF: 6
2019-01-01
Neurocomputing
Abstract:Recently, there is a surge of interest in image-text multimodal representation learning, and many neural network based models have been proposed aiming to capture the interaction between two modalities with different forms of functions. Despite their success, a potential limitation of these methods is insufficient to model all kinds of interactions with a set of static parameters. To alleviate this problem, we present a dynamic interaction network, in which the parameters of the interaction function are dynamically generated by a meta network. Additionally, to provide necessary multimodal features that the meta network needs, we propose a new neural module called Multimodal Transformer . Experimentally, we not only make a comprehensively quantitative evaluation on four image-text tasks, but also show some interpretable analyses of our models, revealing the internal working mechanism of the dynamic parameter learning.
What problem does this paper attempt to address?