Pseudo Content Hallucination for Unpaired Image Captioning

Huixia Ben,Shuo Wang,Meng Wang,Richang Hong
DOI: https://doi.org/10.1145/3652583.3658080
2024-01-01
Abstract:Unpaired Image Captioning (UIC) is designed to describe an image without relying on matched vision-language training data. It is a challenging task since (1) the implicit and unpaired vision-language data nature of the training task limits the captioning model's ability to represent diverse scene representations, and (2) it is difficult for the captioning model to discern the intrinsic relationships among objects, potentially leading to misinterpretation of the image con- tent. To solve these issues, we propose pseudo content hallucination (PCH) to help the captioning model enlarge the perception of the ob- jects and capture the relations between the objects. Specifically, we select similar objects from different images as pseudo content and then hallucinate new visual content for training. This hallucinated content contains a similar scene but with a different representation, thus enriching the diversity of the training samples. Meanwhile, we utilize the relationships among these objects to improve the generated captions as a textual content hallucination and construct pseudo image-sentence pairs to refine the captioning model. These hallucinated sentences are beneficial for the captioning model as they enable the capture of additional semantics from the image, ultimately enhancing the sentence generation ability. Extensive experiments on the two benchmarks, i.e., MSCOCO, and Flickr30k, show the effectiveness of our method. The results show a significant improvement compared to the baseline in the MSCOCO dataset, with 1.5 increase in the CIDEr score.
What problem does this paper attempt to address?