Show, tell and rectify: Boost image caption generation via an output rectifier

Guowei Ge,Yufeng Han,Lingguang Hao,Kuangrong Hao,Bing Wei,Xue-song Tang
DOI: https://doi.org/10.1016/j.neucom.2024.127651
IF: 6
2024-04-05
Neurocomputing
Abstract:Transformer-based models have excellent performance in capturing the interactions between textual and visual features. However, language bias remains a thorny problem in the image captioning domain, leading to the inconsistency between the generated sentences and the actual images. Existing models focus on preventing the wrong words from being output, with little attention to how to correct them. The problem is that if the current word has not yet been output, the model cannot accurately determine whether it is correct. To address this issue, a Double Decoding Transformer framework is proposed. First, a Rectifier is introduced to correct the output sentences in the absence of a language pre-trained module. In addition, visual features provide powerful guidance for attention distribution and redistribution in the Decoder and the Rectifier of the proposed framework, respectively. Due to the presence of downsampling, information loss in the visual feature extraction process is inevitable. Therefore, a Visual Feature Compensation (VFC) module is proposed to compensate for the loss of visual information as much as possible. Finally, by integrating these two modules into a transformer-based framework, a Double Decoding Transformer – D2 Transformer is built. Extensive experiments on the MSCOCO dataset with the "Karpathy" test set demonstrate the validity of the proposed model.
computer science, artificial intelligence
What problem does this paper attempt to address?