Semantic association enhancement transformer with relative position for image captioning

Xin Jia,Yunbo Wang,Yuxin Peng,Shengyong Chen
DOI: https://doi.org/10.1007/s11042-022-12776-5
IF: 2.577
2022-03-15
Multimedia Tools and Applications
Abstract:Transformer-based architectures have shown encouraging results in image captioning. They usually utilize self-attention based methods to establish the semantic association between objects in an image for predicting caption. However, when appearance features between the candidate object and query object show weak dependence, the self-attention based methods are hard to capture the semantic association between them. In this paper, a Semantic Association Enhancement Transformer model is proposed to address the above challenge. First, an Appearance-Geometry Multi-Head Attention is introduced to model a visual relationship by integrating the geometry features and appearance features of the objects. The visual relationship characterizes the semantic association and relative position among the objects. Secondly, a Visual Relationship Improving module is presented to weigh the importance of appearance feature and geometry feature of query object to the modeled visual relationship. Then, the visual relationship among different objects is adaptively improved according to the constructed importance, especially the objects with weak dependence on appearance features, thereby enhancing their semantic association. Extensive experiments on MS COCO dataset demonstrate that the proposed method outperforms the state-of-the-art methods.
computer science, information systems, theory & methods,engineering, electrical & electronic, software engineering
What problem does this paper attempt to address?