Context-Aware Transformer for image captioning

Xin Yang,Ying Wang,Haishun Chen,Jie Li,Tingting Huang
DOI: https://doi.org/10.1016/j.neucom.2023.126440
IF: 6
2023-06-12
Neurocomputing
Abstract:Recently, image captioning models have made remarkable progress by introducing transformer architecture, which utilizes self-attention to explore intra- and inter-modal interactions. However, most existing methods only consider region-level characteristic during the attention weight calculation and ignore the image-level information. This seriously hinders the whole model from understanding the scene content. In this paper, we propose a Context-Aware Transformer (CATNet) with two novel designs, namely Context Augmented Attention (CAA) and Dual Way Controller (DWC). Concretely, CAA in encoder enables the extraction of more comprehensive visual representation through modeling the communications between multi-level visual features. DWC in decoder is used to enhance the fusion between visual features and language representation through utilizing complementarity of global context and local regions. Extensive experiments conducted on MSCOCO dataset show that the proposed CATNet has achieved state-of-the-art performance on both Karpathy test set and online test.
computer science, artificial intelligence
What problem does this paper attempt to address?