Saliency-Based Spatiotemporal Attention for Video Captioning

Yangyu Chen,Weigang Zhang,Shuhui Wang,Liang Li,Qingming Huang
DOI: https://doi.org/10.1109/BigMM.2018.8499257
2018-01-01
Abstract:Most existing video captioning methods ignore the visual saliency information in videos. We suppose that using saliency information can be helpful to generate more accurate video captions. Therefore, we propose a saliency-based spatiotemporal attention mechanism, and integrate it with the encoder-decoder framework of the classical video captioning model. Especially, we design a residual block which can use the saliency information to properly extract the visual feature of video frames. We evaluate our method on MSVD dataset and the results show that exploiting the visual saliency information can improve the performance of video captioning. Specifically, when compared with the traditional temporal attention method, our saliency-based temporal attention model can improve the METEOR and CIDEr metrics by 3.4% and 22.5% respectively. While by using the full saliency-based spatiotemporal attention mechanism, we can further improve the METEOR and CIDEr by 4.5% and 23.1% respectively.
What problem does this paper attempt to address?