Self Attention Re-encoding and Linguistic Ability Preserving for Context-Aware Video Captioning
Mingrui Xiao,Yonglin Xue,Yue Zheng,Shu Yang,Yali Li,Shengjin Wang
DOI: https://doi.org/10.1109/cvidl62147.2024.10603832
2024-01-01
Abstract:The video captioning task requires the model to have both good visual comprehension and text generation ability. Feature extractors of existing model encoders are usually pretrained on recognition and detection tasks, which do not match the data of video captions. Moreover, these methods often employ multiple 2D/3D feature extractors, which also slows down the speed of inference. On the other hand, due to the small size of the captioning data set, it is often difficult for the model to learn enough linguistic ability. In this paper, we propose a new video captioning method. Unlike the previous approach, we only use one visual transformer as the feature extractor and build the time dimension relationship of the 2D frame-level features by re-encoding. The simple encoder design greatly improves the speed of inference. At the same time, we use the generative pretraining language model GPT as our decoder, and introduce the gra-former method, effectively avoiding the problem of pre-training knowledge forgetting caused by the insertion of cross-attention layer so that we can integrate rich language knowledge into the caption model. Through this method, the newly generated words will be better able to draw cues from previous comments. Experimental evaluation of two benchmarks, MSVD and MSR-VTT, shows that the proposed method achieves the most advanced performance. In addition, ablation studies and visualization demonstrate the effectiveness of our approach.