Video captioning – a survey
J. Vaishnavi,V. Narmatha
DOI: https://doi.org/10.1007/s11042-024-18886-6
IF: 2.577
2024-04-10
Multimedia Tools and Applications
Abstract:The explosion of the novel phenomenon of the combination of computer vision and Natural language processing is playing a vital role in converting the ordinary world into a more technological pool. Natural language processing and computer vision are the vanguard of Artificial intelligence with the enormous potential currently ruling more fields. The trending Accelerated research of the NLP and CV combo are cancer screening, surgical simulation, visual properties description, visual retrieval, visual description, etc. Video captioning is a task of localizing the event and various features are extracted from the particular event and then the description of the entire event is performed with Natural languages using various techniques. The task of video captioning has some evolution of various techniques from Traditional methods to Deep learning. Currently, Deep learning techniques are ruling the field. Several models are being proposed for the better enhancement of the video captioning task with the default Encoder-Decoder framework and some models with the variance of Attention mechanism and Transfer learning etc., The Attainment of every technique and the accuracy of the generated results are highly dependent on the nature of the problem, diversity, performance of chosen dataset and construction of the technique with various layers, and the amount of data split for train, validate, and test. This survey discusses the captioning types, real-time applications of video captioning, evolution of various methods in video captioning, advanced techniques, different datasets, metrics that are used to evaluate the results, and analysis of the results of the existing models for video captioning.
computer science, information systems, theory & methods,engineering, electrical & electronic, software engineering