Abstract:In the video captioning methods based on an encoder-decoder, limited visual features are extracted by an encoder, and a natural sentence of the video content is generated using a decoder. However, this kind of method is dependent on a single video input source and few visual labels, and there is a problem with semantic alignment between video contents and generated natural sentences, which are not suitable for accurately comprehending and describing the video contents. To address this issue, this paper proposes a video captioning method by semantic topic-guided generation. First, a 3D convolutional neural network is utilized to extract the spatiotemporal features of videos during the encoding. Then, the semantic topics of video data are extracted using the visual labels retrieved from similar video data. In the decoding, a decoder is constructed by combining a novel Enhance-TopK sampling algorithm with a Generative Pre-trained Transformer-2 deep neural network, which decreases the influence of “deviation” in the semantic mapping process between videos and texts by jointly decoding a baseline and semantic topics of video contents. During this process, the designed Enhance-TopK sampling algorithm can alleviate a long-tail problem by dynamically adjusting the probability distribution of the predicted words. Finally, the experiments are conducted on two publicly used Microsoft Research Video Description and Microsoft Research-Video to Text datasets. The experimental results demonstrate that the proposed method outperforms several state-of-art approaches. Specifically, the performance indicators Bilingual Evaluation Understudy, Metric for Evaluation of Translation with Explicit Ordering, Recall Oriented Understudy for Gisting Evaluation-longest common subsequence, and Consensus-based Image Description Evaluation of the proposed method are improved by 1.2%, 0.1%, 0.3%, and 2.4% on the Microsoft Research Video Description dataset, and 0.1%, 1.0%, 0.1%, and 2.8% on the Microsoft Research-Video to Text dataset, respectively, compared with the existing video captioning methods. As a result, the proposed method can generate video captioning that is more closely aligned with human natural language expression habits.

Extraction method of semantic information of video images

Semantic Video Segmentation Based on Region Boundary Motion Information

Video content description method based on semantic information guidance

Video Captioning with Transferred Semantic Attributes.

Jointly Modeling Embedding and Translation to Bridge Video and Language

A Video Captioning Method by Semantic Topic-Guided Generation

Bidirectional Long-Short Term Memory for Video Description

Video Captioning With Attention-Based LSTM and Semantic Consistency

A Statistics-Based Method For Video Semantic Analysis

Semantic Communication Approach for Multi-Task Image Transmission

Video structural description technology for the new generation video surveillance systems

Semantic Content Mining Approach in Video Based on Vector Space Model

A New Ontology Convolutional Neural Network for Extorting Essential Elements in Video Mining

Exploiting long-term temporal dynamics for video captioning

Semantic Feature Mining for Video Event Understanding.

CC-LSTM: Cross and Conditional Long-Short Time Memory for Video Captioning

VMSG: a video caption network based on multimodal semantic grouping and semantic attention

Research on Semantic Extraction of Online Mathematics Guidance Videos in Colleges and Universities Based on Intelligent Information Technology

Motion Guided Region Message Passing for Video Captioning

Multi-level video captioning method based on semantic space

Long Short Term Memory Recurrent Neural Network Based Encoding Method for Emotion Recognition in Video.