Abstract:An advanced video captioning approach is proposed that works in adaptively and effectively addresses the interdependence between event proposals and captions. Additionally, an attention‐based multimodel framework is introduced to capture the main context from the frame and sound in the video scene. Video captioning aims to identify multiple objects and their behaviours in a video event and generate captions for the current scene. This task aims to generate a detailed description of the current video in real‐time using natural language, which requires deep learning to analyze and determine the relationships between interesting objects in the frame sequence. In practice, existing methods typically involve detecting objects in the frame sequence and then generating captions based on features extracted through object coverage locations. Therefore, the results of caption generation are highly dependent on the performance of object detection and identification. This work proposes an advanced video captioning approach that works in adaptively and effectively addresses the interdependence between event proposals and captions. Additionally, an attention‐based multimodel framework is introduced to capture the main context from the frame and sound in the video scene. Also, an intermediate model is presented to collect the hidden states captured from the input sequence, which performs to extract the main features and implicitly produce multiple event proposals. For caption prediction, the proposed method employs the CARU layer with attention consideration as the primary RNN layer for decoding. Experimental results showed that the proposed work achieves improvements compared to the baseline method and also better performance compared to other state‐of‐the‐art models on the ActivityNet dataset, presenting competitive results in the tasks of video captioning.

ACTUAL: Audio Captioning with Caption Feature Space Regularization.

Caption Feature Space Regularization for Audio Captioning

Audio Difference Learning for Audio Captioning

Adaptive Curriculum Learning for Video Captioning.

QAVidCap: Enhancing Video Captioning Through Question Answering Techniques

Zero-Shot Audio Captioning Using Soft and Hard Prompts

Exploring the Role of Audio in Video Captioning

Diverse Audio Captioning via Adversarial Training

Distractors-Immune Representation Learning with Cross-modal Contrastive Regularization for Change Captioning

Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention

Improving Audio Captioning Models with Fine-grained Audio Features, Text Embedding Supervision, and LLM Mix-up Augmentation

Exploiting Auxiliary Caption for Video Grounding

Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding

Learning Video-Text Aligned Representations for Video Captioning

ControlCap: Controllable Region-level Captioning

Local feature‐based video captioning with multiple classifier and CARU‐attention

Synth-AC: Enhancing Audio Captioning with Synthetic Supervision

Seeing and Hearing Too: Audio Representation for Video Captioning.

Improving the Performance of Automated Audio Captioning via Integrating the Acoustic and Semantic Information

Beyond the Status Quo: A Contemporary Survey of Advances and Challenges in Audio Captioning

ALCAP: Alignment-Augmented Music Captioner