A comprehensive review of image caption generation
Dadure, Pankaj
DOI: https://doi.org/10.1007/s11042-024-20095-0
IF: 2.577
2024-10-13
Multimedia Tools and Applications
Abstract:Image Caption generation is the process of generating textual descriptions of the images by using natural language processing and computer vision. This review explores the burgeoning field of automatic image caption generation, utilizing natural language processing as well as computer vision. This work is an attempt to cover core components, recent advancements, and prospects, it offers a complete examination of the present state of the image caption generation field. It highlights deep learning techniques used for image caption generation including Attention Based Mechanism, Long-Short Term Memory, Recurrent neural networks, Convolutional Neural Networks, and Encoder Decoder techniques. This work also touches upon the complete working process of Image caption generation. It also discusses training strategies such as the Masked Language Model, Reinforcement Learning, and Cross Entropy Loss. The paper emphasizes the importance of datasets, evaluation metrics, and optimization techniques for training image captioning models. Highlighting practical applications in healthcare, autonomous vehicles, and entertainment, the review underlines the broad-ranging implications of image caption generation. It explores future approaches such as multimodal data integration and advancements in unsupervised learning, addressing challenges like captioning complex scenes and managing diverse languages. This concise overview provides a thorough exploration of image caption generation, serving as a valuable resource for researchers, practitioners, and enthusiasts in the field.
computer science, information systems, theory & methods,engineering, electrical & electronic, software engineering