Mining core information by evaluating semantic importance for unpaired image captioning

Jiahui Wei,Zhixin Li,Canlong Zhang,Huifang Ma
DOI: https://doi.org/10.1016/j.neunet.2024.106519
2024-07-09
Abstract:Recently, exciting progress has been made in the research of supervised image captioning. However, manually annotated image-annotation pair data is difficult and expensive to obtain. Therefore, unpaired image captioning becomes an emerging challenge. This paper proposes a method called Mining Core Information by Evaluating Semantic Importance (MCIESI) for Unpaired Image Captioning, which is a method for image captioning using unpaired images and sentences. The main difference from the existing methods is that MCIESI focuses on mining the information that should be described in the image and embodies them in the generated natural language that conforms to human thinking. To achieve this goal, we use scene graphs to represent the semantics of images and evaluates the importance of objects and interaction relationships to mine core information in images, which are then encouraged to be embodied in generated sentences through semantic constraint. Combined with grammatical constraint using adversarial training with real sentence corpus and relative constraint using a triplet loss, the generator is trained to generate semantically plausible and grammatically correct sentences. Extensive experiments verify the effectiveness of MCIESI.
What problem does this paper attempt to address?