Divcon: Learning Concept Sequences for Semantically Diverse Image Captioning

Yue Zheng,Ya-Li Li,Shengjin Wang
DOI: https://doi.org/10.1109/ICASSP49357.2023.10094565
2023-01-01
Abstract:Human generated image captions contain diverse semantic concepts, while this is still a difficult task for machines. The frequency distribution of semantic concepts in datasets is usually extremely imbalanced, leading to models repeatedly describe frequently occurring semantic concepts, resulting in a decline in the semantic diversity. In this paper, we propose a novel two-step method for diverse image captioning, generating descriptions with more diverse semantic concepts (Di-vCon). Firstly, we developed a concept sequence generator to auto-regressively generate concept sequences. This benefits the model by decoding sequences in a small searching space. Then a sentence generator takes as input the concept sequences and generates descriptions for each sequence. Experiments show that DivCon can generate captions containing diverse semantic concepts and pay more attention to the less occurring concepts. In the diverse image captioning task, Div-Con achieves the state-of-the-art results on MSCOCO dataset with oracle CIDEr and SPICE scores of 1.684 and 0.302.
What problem does this paper attempt to address?