Deep Hierarchical Attention Network for Video Description

Shuohao Li,Min Tang,Jun Zhang
DOI: https://doi.org/10.1117/1.jei.27.2.023027
IF: 0.829
2018-01-01
Journal of Electronic Imaging
Abstract:Pairing video to natural language description remains a challenge in computer vision and machine translation. Inspired by image description, which uses an encoder-decoder model for reducing visual scene into a single sentence, we propose a deep hierarchical attention network for video description. The proposed model uses convolutional neural network (CNN) and bidirectional LSTM network as encoders while a hierarchical attention network is used as the decoder. Compared to encoder-decoder models used in video description, the bidirectional LSTM network can capture the temporal structure among video frames. Moreover, the hierarchical attention network has an advantage over single-layer attention network on global context modeling. To make a fair comparison with other methods, we evaluate the proposed architecture with different types of CNN structures and decoders. Experimental results on the standard datasets show that our model has a more superior performance than the state-of-the-art techniques. (C) 2018 SPIE and IS&T
What problem does this paper attempt to address?