Abstract:Video captioning is an automated collection of natural language phrases that explains the contents in video frames. Because of the incomparable performance of deep learning in the field of computer vision and natural language processing in recent years, research in this field has been exponentially increased throughout past decades. Numerous approaches, datasets, and measurement metrics have been introduced in the literature, calling for a systematic survey to guide research efforts in this exciting new direction. Through the statistical analysis, this survey paper focuses mostly on state-of-the-art approaches, emphasizing deep learning models, assessing benchmark datasets in several parameters, and classifying the pros and cons of the various evaluation metrics based on the previous works in the deep learning field. This survey shows the most used variants of neural networks for visual and spatio-temporal feature extraction as well as language generation model. The results show that ResNet and VGG as visual feature extractor and 3D convolutional neural network as spatio-temporal feature extractor are mostly used. Besides that, Long Short Term Memory (LSTM) has been mainly used as the language model. However, nowadays, the Gated Recurrent Unit (GRU) and Transformer are slowly replacing LSTM. Regarding dataset usage, so far, MSVD and MSR-VTT are very much dominant due to be part of outstanding results among various captioning models. From 2015 to 2020, with all major datasets, some models such as, Inception-Resnet-v2 + C3D + LSTM, ResNet-101 + I3D + Transformer, ResNet-152 + ResNext-101 (R3D) + (LSTM, GAN) have achieved by far best results in video captioning. Despite rapid advancement, our survey reveals that video captioning research-work still has a lot to develop in accessing the full potential of deep learning for classifying and captioning a large number of activities, as well as creating large datasets covering diversified training video samples.

Characterizing the impact of using features extracted from pre-trained models on the quality of video captioning sequence-to-sequence models

Multi-scale features with temporal information guidance for video captioning

QAVidCap: Enhancing Video Captioning Through Question Answering Techniques

Multi-Modal interpretable automatic video captioning

Quality Enhancement Based Video Captioning in Video Communication Systems

Exploring the Impact of Vision Features in News Image Captioning

Neuro-Symbolic Representations for Video Captioning: A Case for Leveraging Inductive Biases for Vision and Language

Learning Video-Text Aligned Representations for Video Captioning

Exploring the Role of Audio in Video Captioning

Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions

Video Captioning with Transferred Semantic Attributes.

Experimentelle Untersuchungen am Ctenophorenei

Multimodal-enhanced hierarchical attention network for video captioning

EVC-MF: End-to-end Video Captioning Network with Multi-scale Features

Fusion of Multi-Modal Features to Enhance Dense Video Caption

Pre-trained CNNs as Feature-Extraction Modules for Image Captioning

Video Captioning with Aggregated Features Based on Dual Graphs and Gated Fusion

Learning to enhance areal video captioning with visual question answering

Measuring apoptosis in neural stem cells.

Exploring Video Captioning Techniques: A Comprehensive Survey on Deep Learning Methods

Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models