XlanV Model with Adaptively Multi-Modality Feature Fusing for Video Captioning

Yiqing Huang,Qiuyu Cai,Siyu Xu,Jiansheng Chen
DOI: https://doi.org/10.1145/3394171.3416290
2020-01-01
Abstract:The dynamic feature extracted by the 3D convolutional network and the static feature extracted by CNN are proved to be beneficial for video captioning. We adaptively fuse these two kinds of features in the X-Linear Attention Network Video and propose XlanV model for video captioning. However, we notice that the dynamic feature is not compatible with vision-language pre-training techniques when the frame length distribution and average pixel difference of training video and test video biases. Consequently, we directly train the XlanV model on the MSR-VTT dataset without pre-training on the GIF dataset in this challenge. The proposed XlanV model reaches the 1st place in the pre-training for video captioning challenge, which shows that substantially exploiting the dynamic feature is more effective than vision-language pre-training in this challenge.
What problem does this paper attempt to address?