Abstract:An advanced video captioning approach is proposed that works in adaptively and effectively addresses the interdependence between event proposals and captions. Additionally, an attention‐based multimodel framework is introduced to capture the main context from the frame and sound in the video scene. Video captioning aims to identify multiple objects and their behaviours in a video event and generate captions for the current scene. This task aims to generate a detailed description of the current video in real‐time using natural language, which requires deep learning to analyze and determine the relationships between interesting objects in the frame sequence. In practice, existing methods typically involve detecting objects in the frame sequence and then generating captions based on features extracted through object coverage locations. Therefore, the results of caption generation are highly dependent on the performance of object detection and identification. This work proposes an advanced video captioning approach that works in adaptively and effectively addresses the interdependence between event proposals and captions. Additionally, an attention‐based multimodel framework is introduced to capture the main context from the frame and sound in the video scene. Also, an intermediate model is presented to collect the hidden states captured from the input sequence, which performs to extract the main features and implicitly produce multiple event proposals. For caption prediction, the proposed method employs the CARU layer with attention consideration as the primary RNN layer for decoding. Experimental results showed that the proposed work achieves improvements compared to the baseline method and also better performance compared to other state‐of‐the‐art models on the ActivityNet dataset, presenting competitive results in the tasks of video captioning.

Divided Caption Model with Global Attention

Recurrent convolutional video captioning with global and local attention.

SBAT: Video Captioning with Sparse Boundary-Aware Transformer

Lightweight dense video captioning with cross-modal attention and knowledge-enhanced unbiased scene graph

Streaming Dense Video Captioning

DCMSTRD: End-to-end Dense Captioning via Multi-Scale Transformer Decoding

Multi-scale features with temporal information guidance for video captioning

DVCFlow: Modeling Information Flow Towards Human-like Video Captioning

Motion Guided Spatial Attention for Video Captioning.

Self Attention Re-encoding and Linguistic Ability Preserving for Context-Aware Video Captioning

Video Captioning Using Global-Local Representation

Attention is all you need for Videos: Self-attention based Video Summarization using Universal Transformers

Local feature‐based video captioning with multiple classifier and CARU‐attention

Multimodal-enhanced hierarchical attention network for video captioning

STAT: Spatial-Temporal Attention Mechanism for Video Captioning

Video Captioning with Aggregated Features Based on Dual Graphs and Gated Fusion

Collaborative Detection and Caption Network

Dual-Stream Transformer for Generic Event Boundary Captioning

Video Captioning With Attention-Based LSTM and Semantic Consistency

Fully Convolutional Video Captioning with Coarse-to-Fine and Inherited Attention