Abstract:Generating consecutive descriptions for videos, that is, video captioning, requires taking full advantage of visual representation along with the generation process. Existing video captioning methods focus on an exploration of spatial-temporal representations and their relationships to produce inferences. However, such methods only exploit the superficial association contained in a video itself without considering the intrinsic visual commonsense knowledge that exists in a video dataset, which may hinder their capabilities of knowledge cognitive to reason accurate descriptions. To address this problem, we propose a simple, yet effective method, called visual commonsense-aware representation network (VCRN), for video captioning. Specifically, we construct a Video Dictionary, a plug-and-play component, obtained by clustering all video features from the total dataset into multiple clustered centers without additional annotation. Each center implicitly represents a visual commonsense concept in a video domain, which is utilized in our proposed visual concept selection (VCS) component to obtain a video-related concept feature. Next, a concept-integrated generation (CIG) component is proposed to enhance caption generation. Extensive experiments on three public video captioning benchmarks: MSVD, MSR-VTT, and VATEX, demonstrate that our method achieves state-of-the-art performance, indicating the effectiveness of our method. In addition, our method is integrated into the existing method of video question answering (VideoQA) and improves this performance, which further demonstrates the generalization capability of our method. The source code has been released at https://github.com/zchoi/VCRN.

TVT: Two-View Transformer Network for Video Captioning.

SBAT: Video Captioning with Sparse Boundary-Aware Transformer

TransVOS: Video Object Segmentation with Transformers

Video captioning based on vision transformer and reinforcement learning

Temporal Deformable Convolutional Encoder-Decoder Networks for Video Captioning

I2Transformer: Intra- and Inter-relation Embedding Transformer for TV Show Captioning

Video Captioning With Temporal And Region Graph Convolution Network

Stacked Convolutional Deep Encoding Network for Video-Text Retrieval.

Visual Commonsense-Aware Representation Network for Video Captioning

CLIP4Caption: CLIP for Video Caption

Bidirectional transformer with knowledge graph for video captioning

Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval

Dual-Stream Recurrent Neural Network for Video Captioning

Visual Oriented Encoder: Integrating Multimodal and Multi-Scale Contexts for Video Captioning

Utilizing Text-based Augmentation to Enhance Video Captioning

Video Captioning with Transferred Semantic Attributes.

Synchronized Audio-Visual Frames with Fractional Positional Encoding for Transformers in Video-to-Text Translation

Delving Deeper into the Decoder for Video Captioning

Multi-scale features with temporal information guidance for video captioning

Convolutional Reconstruction-to-Sequence for Video Captioning

Video captioning using transformer-based GAN