A Multi-stage Feature Enhancement Network with Second-Attention for Image Captioning
Xiaobao Yang,Yang,Junsheng Wu,Sugang Ma,Zhiqiang Hou,Bohui Song
DOI: https://doi.org/10.1145/3641584.3641586
2024-01-01
Abstract:In the past few years, self-attention in Transformer has been widely used in natural language processing (NLP) and computer vision (CV) due to its excellent ability to capture global information, especially in image captioning, where the use of self-attention can significantly improve the representation of visual information. However, the obtained image feature from Transformer-based models suffers from two main problems, one is information redundancy due to the global aggregation of self-attention, and the other is the lack of semantic information caused by single-scale feature extraction. Therefore, we first propose second-attention in this paper, by redistributing image attention weights, second-attention can effectively remove irrelevant information and enhance the attention to essential objects and relations. Meanwhile, to enrich semantic information and further enhance the role of second-attention, we design a Multi-stage Feature Enhancement (MFE) network to improve the ability to represent visual information. After conducting extensive experiments on the MS COCO dataset, we achieve significant improvements in all popular benchmarks, particularly BLEU4 increased by 1.6%, CIDEr increased by 3.4%.