Domain-specific image captioning: a comprehensive review

Himanshu Sharma,Devanand Padha
DOI: https://doi.org/10.1007/s13735-024-00328-6
2024-04-20
International Journal of Multimedia Information Retrieval
Abstract:An image caption is a sentence summarizing the semantic details of an image. It is a blended application of computer vision and natural language processing. The earlier research addressed this domain using machine learning approaches by modeling image captioning frameworks using hand-engineered feature extraction techniques. With the resurgence of deep-learning approaches, the development of improved and efficient image captioning frameworks is on the rise. Image captioning is witnessing tremendous growth in various domains as medical, remote sensing, security, visual assistance, and multimodal search engines. In this survey, we comprehensively study the image captioning frameworks based on our proposed domain-specific taxonomy. We explore the benchmark datasets and metrics leveraged for training and evaluating image captioning models in various application domains. In addition, we also perform a comparative analysis of the reviewed models. Natural image captioning, medical image captioning, and remote sensing image captioning are currently among the most prominent application domains of image captioning. The efficacy of real-time image captioning is a challenging obstacle limiting its implementation in sensitive areas such as visual aid, remote security, and healthcare. Further challenges include the scarcity of rich domain-specific datasets, training complexity, evaluation difficulty, and a deficiency of cross-domain knowledge transfer techniques. Despite the significant contributions made, there is a need for additional efforts to develop steadfast and influential image captioning models.
computer science, artificial intelligence, software engineering
What problem does this paper attempt to address?