Image Captioning Based on Scene Graphs: A Survey
Junhua Jia,Xiangqian Ding,Shunpeng Pang,Xiaoyan Gao,Xiaowei Xin,Ruotong Hu,Jie Nie
DOI: https://doi.org/10.1016/j.eswa.2023.120698
2022-01-01
SSRN Electronic Journal
Abstract:Although recent developments in deep learning have brought several tasks closer to human performance, there is still a significant gap between human and machine performance in certain image captioning tasks. Image captioning is the process of creating a textual description of an image. Image captioning focuses on recognizing the main regions of an image, their attributes, and their relationships. It aims to generate textual descriptions that are syntactically and semantically correct. For simple image descriptions, deep learning-based techniques perform well in terms of intricacies and constraints. However, it is challenging to construct sentences when faced with complicated scenes with many entities and relationships, such as how to concurrently solve diversity, anchoring, and controllability—a seemingly simple ability for humans. Scene graphs can significantly alleviate this problem by fully mining spatial and semantic information. However, despite these promising findings, they are fragmented and do not form a systematic comparative overview. We provide a comprehensive overview of the available scene-graph-based image captioning methods in this survey. The foundations of these techniques are discussed to examine their performance, strengths, and constraints. Furthermore, we discuss the comparisons of the state-of-the-art methods, datasets, and commonly utilized evaluation measures. Finally, we conclude the survey with an in-depth discussion of the present and future research challenges. This study will assist readers in comprehending how scene graphs can be applied to image captioning.