Abstract:Video description generates natural language sentences that describe the subject, verb, and objects of the targeted Video. The video description has been used to help visually impaired people to understand the content. It is also playing an essential role in devolving human-robot interaction. The dense video description is more difficult when compared with simple Video captioning because of the object’s interactions and event overlapping. Deep learning is changing the shape of computer vision (CV) technologies and natural language processing (NLP). There are hundreds of deep learning models, datasets, and evaluations that can improve the gaps in current research. This article filled this gap by evaluating some state-of-the-art approaches, especially focusing on deep learning and machine learning for video caption in a dense environment. In this article, some classic techniques concerning the existing machine learning were reviewed. And provides deep learning models, a detail of benchmark datasets with their respective domains. This paper reviews various evaluation metrics, including Bilingual Evaluation Understudy (BLEU), Metric for Evaluation of Translation with Explicit Ordering (METEOR), Word Mover’s Distance (WMD), and Recall-Oriented Understudy for Gisting Evaluation (ROUGE) with their pros and cons. Finally, this article listed some future directions and proposed work for context enhancement using key scene extraction with object detection in a particular frame. Especially, how to improve the context of video description by analyzing key frames detection through morphological image analysis. Additionally, the paper discusses a novel approach involving sentence reconstruction and context improvement through key frame object detection, which incorporates the fusion of large language models for refining results. The ultimate results arise from enhancing the generated text of the proposed model by improving the predicted text and isolating objects using various keyframes. These keyframes identify dense events occurring in the video sequence.

A Neural ODE and Transformer-based Model for Temporal Understanding and Dense Video Captioning

SBAT: Video Captioning with Sparse Boundary-Aware Transformer

Exploiting long-term temporal dynamics for video captioning

Temporal Deformable Convolutional Encoder-Decoder Networks for Video Captioning

RESTHT: relation-enhanced spatial–temporal hierarchical transformer for video captioning

Towards Knowledge-aware Video Captioning via Transitive Visual Relationship Detection

Streaming Dense Video Captioning

Video Captioning With Temporal And Region Graph Convolution Network

Neuro-Symbolic Representations for Video Captioning: A Case for Leveraging Inductive Biases for Vision and Language

Multi-scale features with temporal information guidance for video captioning

Visual Oriented Encoder: Integrating Multimodal and Multi-Scale Contexts for Video Captioning

Whats in a Video: Factorized Autoregressive Decoding for Online Dense Video Captioning

Utilizing Text-based Augmentation to Enhance Video Captioning

Trends in Event Understanding and Caption Generation/Reconstruction in Dense Video: A Review

Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning

Video Captioning with Transferred Semantic Attributes.

Optimizing Latency for Online Video CaptioningUsing Audio-Visual Transformers

Towards Neuro-Symbolic Video Understanding

Exploring Video Captioning Techniques: A Comprehensive Survey on Deep Learning Methods

Video captioning using transformer-based GAN

Bidirectional transformer with knowledge graph for video captioning