Abstract:Video captioning is an automated collection of natural language phrases that explains the contents in video frames. Because of the incomparable performance of deep learning in the field of computer vision and natural language processing in recent years, research in this field has been exponentially increased throughout past decades. Numerous approaches, datasets, and measurement metrics have been introduced in the literature, calling for a systematic survey to guide research efforts in this exciting new direction. Through the statistical analysis, this survey paper focuses mostly on state-of-the-art approaches, emphasizing deep learning models, assessing benchmark datasets in several parameters, and classifying the pros and cons of the various evaluation metrics based on the previous works in the deep learning field. This survey shows the most used variants of neural networks for visual and spatio-temporal feature extraction as well as language generation model. The results show that ResNet and VGG as visual feature extractor and 3D convolutional neural network as spatio-temporal feature extractor are mostly used. Besides that, Long Short Term Memory (LSTM) has been mainly used as the language model. However, nowadays, the Gated Recurrent Unit (GRU) and Transformer are slowly replacing LSTM. Regarding dataset usage, so far, MSVD and MSR-VTT are very much dominant due to be part of outstanding results among various captioning models. From 2015 to 2020, with all major datasets, some models such as, Inception-Resnet-v2 + C3D + LSTM, ResNet-101 + I3D + Transformer, ResNet-152 + ResNext-101 (R3D) + (LSTM, GAN) have achieved by far best results in video captioning. Despite rapid advancement, our survey reveals that video captioning research-work still has a lot to develop in accessing the full potential of deep learning for classifying and captioning a large number of activities, as well as creating large datasets covering diversified training video samples.

Stateful Human-Centered Visual Captioning System to Aid Video Surveillance

Visual Commonsense-Aware Representation Network for Video Captioning

Video Captioning with Transferred Semantic Attributes.

Exploring Video Captioning Techniques: A Comprehensive Survey on Deep Learning Methods

A novel Multi-Layer Attention Framework for visual description prediction using bidirectional LSTM

Video captioning – a survey

Neuro-Symbolic Representations for Video Captioning: A Case for Leveraging Inductive Biases for Vision and Language

Seeing Bot

Learning to enhance areal video captioning with visual question answering

Automatic Generation of Descriptive Titles for Video Clips Using Deep Learning

CC-LSTM: Cross and Conditional Long-Short Time Memory for Video Captioning

Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions

From Captions to Visual Concepts and Back

D-CNN: A New model for Generating Image Captions with Text Extraction Using Deep Learning for Visually Challenged Individuals

A Review of Deep Learning for Video Captioning

A Lightweight Visual Understanding System for Enhanced Assistance to the Visually Impaired Using an Embedded Platform

Neuraltalk+: neural image captioning with visual assistance capabilities

Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks

Exploiting long-term temporal dynamics for video captioning

Deep Learning for Video Captioning: A Review

Video Captioning With Attention-Based LSTM and Semantic Consistency