Hierarchical Attention-Based Multimodal Fusion for Video Captioning

Chunlei Wu,Yiwei Wei,Xiaoliang Chu,Sun Weichen,Fei Su,Leiquan Wang
DOI: https://doi.org/10.1016/j.neucom.2018.07.029
IF: 6
2018-01-01
Neurocomputing
Abstract:Attention based encoder-decoder models have shown a great success on video captioning. Recent multi-modal video captioning mainly focused on applying the attention mechanism to all modalities and fusing them in the same level. However, the connections among specific modalities have not been investigated in the fusion process. In this paper, the expressivity of uni-modal is firstly investigated. Due to the characteristic of attention mechanism, an instance-level of visual content is exploited to refine the temporal features. Then, a semantic detection architecture based on CNN+RNN is also employed on the spatiotemporal content to exploit the correlations between semantic labels for better video semantic representation. Finally, a hierarchical attention-based multimodal fusion model for video captioning is proposed by jointly considering the intrinsic properties of multimodal features. Experimental results on the MSVD and MSR-VTT datasets show that the proposed method has achieved competitive performance compared with the related video captioning methods.
What problem does this paper attempt to address?