Zero-Shot Image Caption Inference System Based on Pretrained Models
Xiaochen Zhang,Jiayi Shen,Yuyan Wang,Jiacong Xiao,Jin Li
DOI: https://doi.org/10.3390/electronics13193854
IF: 2.9
2024-09-29
Electronics
Abstract:Recently, zero-shot image captioning (ZSIC) has gained significant attention, given its potential to describe unseen objects in images. This is important for real-world applications such as human–computer interaction, intelligent education, and service robots. However, the zero-shot image captioning method based on large-scale pretrained models may generate descriptions containing objects that are not present in the image, which is a phenomenon termed "object hallucination". This is because large-scale models tend to predict words or phrases with high frequency, as seen in the training phase. Additionally, the method set a limitation to the description length, which often leads to an improper ending. In this paper, a novel approach is proposed to address and reduce the object hallucination and improper ending problem in the ZSIC task. We introduce additional emotion signals as guidance for sentence generation, and we find that proper emotion will filter words that do not appear in the image. Moreover, we propose a novel strategy that gradually extends the number of words in a sentence to confirm the generated sentence is properly completed. Experimental results show that the proposed method achieves the leading performance on unsupervised metrics. More importantly, the subjective examples illustrate the effect of our method in improving hallucination and generating properly ending sentences.
engineering, electrical & electronic,computer science, information systems,physics, applied