Object Semantic Analysis for Image Captioning

Sen Du,Hong Zhu,Guangfeng Lin,Dong Wang,Jing Shi,Jing Wang
DOI: https://doi.org/10.1007/s11042-023-14596-7
IF: 2.577
2023-01-01
Multimedia Tools and Applications
Abstract:Although existing image captioning models can produce sentences through attention mechanisms and recurrent neural networks, it is difficult to generate multiple sentences to describe different important objects. Most image captioning models lack description diversity, whereas the diversity models often describe unimportant objects, resulting in low accuracy. In this paper, we propose a novel approach to balancing accuracy and diversity. To achieve this, we designed a novel model which combines saliency information and objects’ relative position information to assess the semantic importance of all detected objects. By maintaining the features of important objects and making the network able to describe important objects by operating on the features of unimportant objects, our model can generate sentences with more diversity or accuracy. Experiments demonstrate the characteristics of our model on the MSCOCO and Flickr 30K datasets. In this dataset, our model can provide a set of accurate or diverse descriptions. Compared with the state-of-art models by standard captioning metrics and human evaluation metrics, our model outperforms these works in being able to generate more diverse or accuracy sentences.
What problem does this paper attempt to address?