Memory-Based Augmentation Network for Video Captioning
Shuaiqi Jing,Haonan Zhang,Pengpeng Zeng,Lianli Gao,Jingkuan Song,Heng Tao Shen
DOI: https://doi.org/10.1109/tmm.2023.3295098
IF: 7.3
2024-01-01
IEEE Transactions on Multimedia
Abstract:Video captioning focuses on generating natural language descriptions according to the video content. Existing works mainly explore this multimodal learning with the paired source video and corresponding sentence, which have achieved competitive performances. Nonetheless, learning from video-description pair cannot capture implicit external knowledge, i.e. multiple visual context information and linguistic clues existing in the video-language dataset, which may limit the cognitive capability of the model to generate diverse descriptions. To this end, we propose a Memory-based Augmentation Network (MAN), in which a memory structure is designed to augment the current encoder-decoder framework by incorporating implicit external knowledge with a neural memory. Specifically, we first propose a visual memory for the encoder to store multiple visual contexts across videos in the dataset, which is utilized to obtain memory-augmented contextual features for the source video. In addition, a textual memory is introduced for the decoder to capture the external language clues across sentences in the dataset. It is adapted to capture memory-augmented language features in each time step. The proposed approach is able to capture comprehensive contextual understanding compared to the basic encoder-decoder framework, which is more compatible with the human cognitive process. Extensive experiments on three video captioning datasets including MSVD, MSR-VTT, and VATEX demonstrate the effectiveness of the proposed method. The source code is available at https://github.com/zchoi/MAN .