Abstract:Video captioning aims to automatically generate a textual description for a video, which is a challenging task and has drawn attention recently. Despite existing methods have achieved impressive performance, two challenging problems are remaining to be solved. (1) In the feature encoding stage, existing methods only focus on local features or global features to improve the accuracy or readability of sentences generated, resulting in the underutilization of useful information for the given video. (2) In the decoder stage, vanilla Transformer is usually used to reason about visual relations to generate the textual captions, which is not making good use of the inter-frame temporal information, leads to the relation ambiguity and bad readability for generated captions. To solve these problems, we propose a method of video captioning based on multi-scale feature with temporal information guidance for video captioning. Firstly, the pre-training model CLIP is employed to extract video features. Secondly, the global and local features are encoded separately to learn the overall and detailed information of the video and construct multi-scale features. Finally, the gating unit is used to alleviate the problem which cannot make good use of contextual temporal information in existing decoder module base Transformer. Extensive experiments on two publicly available datasets show that the proposed model improves 4.7%, 2.2%, 0.6%, 2.0% on the MSR-VTT dataset, and 5.1%, 9.0%, 5.8%, 6.7% on the MSVD dataset compared to the best model in the comparison method in the BLEU, METEOR, ROUGE-L, and CIDEr metrics, which demonstrates the ability of our method to achieve more competitive performance.

Global-Shared Text Representation Based Multi-Stage Fusion Transformer Network for Multi-Modal Dense Video Captioning

Fusion of Multi-Modal Features to Enhance Dense Video Caption

SBAT: Video Captioning with Sparse Boundary-Aware Transformer

Multi-scale features with temporal information guidance for video captioning

Event-centric multi-modal fusion method for dense video captioning

EVC-MF: End-to-end Video Captioning Network with Multi-scale Features

Video Captioning with Aggregated Features Based on Dual Graphs and Gated Fusion

Attention-based Visual-Audio Fusion for Video Caption Generation.

Stacked Convolutional Deep Encoding Network for Video-Text Retrieval.

FTAN: Exploring Frame-Text Attention for Lightweight Video Captioning.

CMGNet: Collaborative multi-modal graph network for video captioning

Multimodal-enhanced hierarchical attention network for video captioning

MFVC: Urban Traffic Scene Video Caption Based on Multimodal Fusion

Video Captioning Using Global-Local Representation

Multimodal feature fusion based on object relation for video captioning

Multimodal Semantic Attention Network for Video Captioning

Dual-Stream Recurrent Neural Network for Video Captioning

Divided Caption Model with Global Attention

Video Captioning With Temporal And Region Graph Convolution Network

Learning Multimodal Attention LSTM Networks for Video Captioning.

Fused GRU with Semantic-Temporal Attention for Video Captioning.