Abstract:Video description refers to understanding visual content and transforming that acquired understanding into automatic textual narration. It bridges the key AI fields of computer vision and natural language processing in conjunction with real-time and practical applications. Deep learning-based approaches employed for video description have demonstrated enhanced results compared to conventional approaches. The current literature lacks a thorough interpretation of the recently developed and employed sequence to sequence techniques for video description. This paper fills that gap by focusing mainly on deep learning-enabled approaches to automatic caption generation. Sequence to sequence models follow an Encoder–Decoder architecture employing a specific composition of CNN, RNN, or the variants LSTM or GRU as an encoder and decoder block. This standard-architecture can be fused with an attention mechanism to focus on a specific distinctiveness, achieving high quality results. Reinforcement learning employed within the Encoder–Decoder structure can progressively deliver state-of-the-art captions by following exploration and exploitation strategies. The transformer mechanism is a modern and efficient transductive architecture for robust output. Free from recurrence, and solely based on self-attention, it allows parallelization along with training on a massive amount of data. It can fully utilize the available GPUs for most NLP tasks. Recently, with the emergence of several versions of transformers, long term dependency handling is not an issue anymore for researchers engaged in video processing for summarization and description, or for autonomous-vehicle, surveillance, and instructional purposes. They can get auspicious directions from this research.

Deep Hierarchical Attention Network for Video Description

A Video Description Model with Improved Attention Mechanism

Fully Convolutional Video Captioning with Coarse-to-Fine and Inherited Attention

Multimodal-enhanced hierarchical attention network for video captioning

Hierarchical LSTM with Adjusted Temporal Attention for Video Captioning

Bidirectional Long-Short Term Memory for Video Description

Syntax-Guided Hierarchical Attention Network for Video Captioning

Hierarchical LSTMs with Adaptive Attention for Visual Captioning

A novel Multi-Layer Attention Framework for visual description prediction using bidirectional LSTM

Hierarchical Global-Local Temporal Modeling for Video Captioning

Stacked Convolutional Deep Encoding Network for Video-Text Retrieval.

Video description: A comprehensive survey of deep learning approaches

Hierarchical Attention Based Spatial-Temporal Graph-to-Sequence Learning for Grounded Video Description

Hierarchical Boundary-Aware Neural Encoder for Video Captioning

Collaborative Detection and Caption Network

Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning

Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks

Multi-attention mechanism for Chinese description of videos

Spatio-Temporal Ranked-Attention Networks for Video Captioning

Attention-Based Multimodal Fusion for Video Description

Learning Multiscale Hierarchical Attention for Video Summarization