Context Gating with Short Temporal Information for Video Captioning.

Jinlei Xu,Ting Xu,Xin Tian,Chunping Liu,Yi Ji
DOI: https://doi.org/10.1109/ijcnn.2019.8851897
2019-01-01
Abstract:Video Captioning is a newly emerging task which automatically translates content in a video into a textual description. Similar to image captioning, most existing methods simply utilized extracted visual features to generate sentences. However, in video captioning temporal information is much more important for description. Though the short temporal information (STI) is always ignored. Meanwhile, the context of generated sentence seems not been mined enough. In this paper, we build a context gating mechanism with STI based on encoder-decoder (CG-ED) neural framework for video captioning. In our approach, based on the 2D feature space, we cut and recombine the whole 3D features to extract STI by temporal distribution. To balance the contributions of different context of sentences, context gating is designed. Our proposed model is evaluated on two large-scale datasets: Microsoft Research-Video to Text (MSR-VTT) and Microsoft Research Video Description Corpus(MSVD). Experimental results demonstrate that its precision of caption is higher than most of the state-of-the-art approaches.
What problem does this paper attempt to address?