Abstract:Video captioning aims to automatically generate a textual description for a video, which is a challenging task and has drawn attention recently. Despite existing methods have achieved impressive performance, two challenging problems are remaining to be solved. (1) In the feature encoding stage, existing methods only focus on local features or global features to improve the accuracy or readability of sentences generated, resulting in the underutilization of useful information for the given video. (2) In the decoder stage, vanilla Transformer is usually used to reason about visual relations to generate the textual captions, which is not making good use of the inter-frame temporal information, leads to the relation ambiguity and bad readability for generated captions. To solve these problems, we propose a method of video captioning based on multi-scale feature with temporal information guidance for video captioning. Firstly, the pre-training model CLIP is employed to extract video features. Secondly, the global and local features are encoded separately to learn the overall and detailed information of the video and construct multi-scale features. Finally, the gating unit is used to alleviate the problem which cannot make good use of contextual temporal information in existing decoder module base Transformer. Extensive experiments on two publicly available datasets show that the proposed model improves 4.7%, 2.2%, 0.6%, 2.0% on the MSR-VTT dataset, and 5.1%, 9.0%, 5.8%, 6.7% on the MSVD dataset compared to the best model in the comparison method in the BLEU, METEOR, ROUGE-L, and CIDEr metrics, which demonstrates the ability of our method to achieve more competitive performance.

Vatex Video Captioning Challenge 2020: Multi-View Features and Hybrid Reward Strategies for Video Captioning

Multi-modal Feature Fusion with Feature Attention for VATEX Captioning Challenge 2020

CLIP4Caption ++: Multi-CLIP for Video Caption

QAVidCap: Enhancing Video Captioning Through Question Answering Techniques

VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research

XlanV Model with Adaptively Multi-Modality Feature Fusing for Video Captioning

Enhanced Video Caption Generation Based on Multimodal Features.

Video captioning based on vision transformer and reinforcement learning

Learning Video-Text Aligned Representations for Video Captioning

Delving Deeper into the Decoder for Video Captioning

Multi-scale features with temporal information guidance for video captioning

Non-Autoregressive Coarse-to-Fine Video Captioning

Multi-Modal interpretable automatic video captioning

Edit As You Wish: Video Caption Editing with Multi-grained User Control

The Solution for the CVPR2024 NICE Image Captioning Challenge

Distilling Vision-Language Models on Millions of Videos

Self Attention Re-encoding and Linguistic Ability Preserving for Context-Aware Video Captioning

Center-enhanced video captioning model with multimodal semantic alignment

Subject-Oriented Video Captioning

Pre-training for Video Captioning Challenge 2020 Summary

Integrating both Visual and Audio Cues for Enhanced Video Caption