Adaptive Semantic-Enhanced Transformer for Image Captioning.

Jing Zhang,Zhongjun Fang,Han Sun,Zhe Wang
DOI: https://doi.org/10.1109/tnnls.2022.3185320
IF: 14.255
2024-01-01
IEEE Transactions on Neural Networks and Learning Systems
Abstract:In the research on image captioning, rich semantic information is very important for generating critical caption words as guiding information. However, semantic information from offline object detectors involves many semantic objects that do not appear in the caption, thereby bringing noise into the decoding process. To produce more accurate semantic guiding information and further optimize the decoding process, we propose an end-to-end adaptive semantic-enhanced transformer (AS-Transformer) model for image captioning. For semantic enhancement information extraction, we propose a constrained weaklysupervised learning (CWSL) module, which reconstructs the semantic object's probability distribution detected by the multiple instances learning (MIL) through a joint loss function. These strengthened semantic objects from the reconstructed probability distribution can better depict the semantic meaning of images. Also, for semantic enhancement decoding, we propose an adaptive gated mechanism (AGM) module to adjust the attention between visual and semantic information adaptively for the more accurate generation of caption words. Through the joint control of the CWSL module and AGM module, our proposed model constructs a complete adaptive enhancement mechanism from encoding to decoding and obtains visual context that is more suitable for captions. Experiments on the public Microsoft Common Objects in COntext (MSCOCO) and Flickr30K datasets illustrate that our proposed AS-Transformer can adaptively obtain effective semantic information and adjust the attention weights between semantic and visual information automatically, which achieves more accurate captions compared with semantic enhancement methods and outperforms state-of-the-art methods.
What problem does this paper attempt to address?