TVT: Two-View Transformer Network for Video Captioning.

Ming Chen,Yingming Li,Zhongfei Zhang,Siyu Huang
2018-01-01
Abstract:Video captioning is a task of automatically generating the natural text description of a given video. There are two main challenges in video captioning under the context of an encoder-decoder framework: 1) How to model the sequential information; 2) How to combine the modalities including video and text. For challenge 1), the recurrent neural networks (RNNs) based methods are cur-rently the most common approaches for learning temporal representations of videos, while they suffer from a high computational cost. For challenge 2), the features of different modalities are often roughly concatenated together without insightful discussion. In this paper, we introduce a novel video captioning framework, i.e., Two-View Transformer (TVT). TVT comprises of a backbone of Transformer network for sequential representation and two types of fusion blocks in decoder layers for combining different modalities effectively. Empirical study shows that our TVT model outper-forms the state-of-the-art methods on the MSVD dataset and achieves a competitive performance on the MSR-VTT dataset under four common metrics.
What problem does this paper attempt to address?