VideoTRM: Pre-training for Video Captioning Challenge 2020

Jingwen Chen,Hongyang Chao
DOI: https://doi.org/10.1145/3394171.3416291
2020-01-01
Abstract:The Pre-training for Video Captioning Challenge 2020 mainly focuses on developing video captioning systems by pre-training on the newly released large-scale Auto-captions on GIF dataset and further transferring the pre-trained model to MSR-VTT benchmark. As a part of the submission to this challenge, we propose a Transformer based framework named VideoTRM, which consists of four modules: a textual encoder for encoding the linguistic relationship among words in the input sentence, a visual encoder for capturing the temporal dynamics in the input video, a cross-modal encoder for modeling the interactions between the two modalities (i.e., textual and visual) and a decoder for sentence generation conditioned on the input video and words generated previously. Additionally, we extend the decoder in our VideoTRM with mesh-like connections and gate fusion mechanism in multi-head attention during fine-tuning to take advantage of multi-level visual features and bypass less informative attention results, respectively. In the evaluation on test server, our VideoTRM achieves superior performances and ranks the second place on the leadboard finally.
What problem does this paper attempt to address?