Gated multimodal networks

John Arevalo,Thamar Solorio,Manuel Montes-y-Gómez,Fabio A. González
DOI: https://doi.org/10.1007/s00521-019-04559-1
2020-01-15
Neural Computing and Applications
Abstract:This paper considers the problem of leveraging multiple sources of information or data modalities (e.g., images and text) in neural networks. We define a novel model called gated multimodal unit (GMU), designed as an internal unit in a neural network architecture whose purpose is to find an intermediate representation based on a combination of data from different modalities.The GMU learns to decide how modalities influence the activation of the unit using multiplicative gates.The GMU can be used as a building block for different kinds of neural networks and can be seen as a form of intermediate fusion. The model was evaluated on two multimodal learning tasks in conjunction with fully connected and convolutional neural networks. We compare the GMU with other early- and late-fusion methods, outperforming classification scores in two benchmark datasets: MM-IMDb and DeepScene.
computer science, artificial intelligence
What problem does this paper attempt to address?