Describing Videos Using Multi-modal Fusion.

Qin Jin,Jia Chen,Shizhe Chen,Yifan Xiong,Alexander G. Hauptmann
DOI: https://doi.org/10.1145/2964284.2984065
2016-01-01
Abstract:Describing videos with natural language is one of the ultimate goals of video understanding. Video records multi-modal information including image, motion, aural, speech and so on. MSR Video to Language Challenge provides a good chance to study multi-modality fusion in caption task. In this paper, we propose the multi-modal fusion encoder and integrate it with text sequence decoder into an end-to-end video caption framework. Features from visual, aural, speech and meta modalities are fused together to represent the video contents. Long Short-Term Memory Recurrent Neural Networks (LSTM-RNNs) are then used as the decoder to generate natural language sentences. Experimental results show the effectiveness of multi-modal fusion encoder trained in the end-to-end framework, which achieved top performance in both common metrics evaluation and human evaluation.
What problem does this paper attempt to address?