Variational Joint Self‐attention for Image Captioning

Xiangjun Shao,Zhenglong Xiang,Yuanxiang Li,Mingjie Zhang
DOI: https://doi.org/10.1049/ipr2.12470
IF: 2.3
2022-01-01
IET Image Processing
Abstract:The image captioning task has attracted great attention from many researchers, and significant progress has been made in the past few years. Existing image captioning models, which mainly apply attention-based encoder-decoder architecture, achieve great developments image captioning. These attention-based models, however, are limited in the caption generation due to the potential errors resulting from the inaccurate detection of objects and incorrect attention to the objects. To alleviate the limitation, a Variational Joint Self-Attention model (VJSA) is proposed to learn a latent semantic alignment between the given image and its label description for guiding better image captioning. Unlike the existing image captioning models, VJSA first uses a self-attention module to encode the effective relationship information of intra-sequence and inter-sequences relationships. And then the variational neural inference module learns a distribution over the latent semantic alignment between the image and its corresponding description. In the decoding, the learned semantic alignment guides the decoder to generate the higher quality image caption. The results of the experiments reveal that the VJSA outperforms the compared models, and the performances of various metrics show that the proposed model is effective and feasible in image caption generation.
What problem does this paper attempt to address?