Residual Attention-Based LSTM for Video Captioning

Xiangpeng Li,Zhilong Zhou,Lijiang Chen,Lianli Gao
DOI: https://doi.org/10.1007/s11280-018-0531-z
2018-01-01
World Wide Web
Abstract:Recently great success has been achieved by proposing a framework with hierarchical LSTMs in video captioning, such as stacked LSTM networks. When deeper LSTM layers are able to start converging, a degradation problem has been exposed. With the number of LSTM layers increasing, accuracy gets saturated and then degrades rapidly like standard deep convolutional networks such as VGG. In this paper, we propose a novel attention-based framework, namely Residual Attention-based LSTM (Res-ATT), which not only takes advantage of existing attention mechanism but also considers the importance of sentence internal information which usually gets lost in the transmission process. Our key novelty is that we show how to integrate residual mapping into a hierarchical LSTM network to solve the degradation problem. More specifically, our novel hierarchical architecture builds on two LSTMs layers and residual mapping is introduced to avoid the loss of previous generated words information (i.e., both content information and relationship information). Experimental results on the mainstream datasets: MSVD and MSR-VTT, which shows that our framework outperforms the state-of-the-art approaches. Furthermore, our automatically generated sentences can provide more detailed information to precisely describe a video.
What problem does this paper attempt to address?