Tag‐inferring and tag‐guided Transformer for image captioning
Yaohua Yi,Yinkai Liang,Dezhu Kong,Ziwei Tang,Jibing Peng
DOI: https://doi.org/10.1049/cvi2.12280
IF: 1.484
2024-03-24
IET Computer Vision
Abstract:In this work, we present a tag‐inferring and tag‐guided Transformer for image captioning to explore the role of tags. First, tag‐inferring encoder (TIE) is designed to infer tags with rich semantic information by combining tags provided by the scene graph model and image features extracted by the detection model. Then, tag‐guided decoder (TGD), including short‐term attention (STA) and gated cross‐modal attention, is proposed to decode sentences with the semantic information provided by tags and image features. Image captioning is an important task for understanding images. Recently, many studies have used tags to build alignments between image information and language information. However, existing methods ignore the problem that simple semantic tags have difficulty expressing the detailed semantics for different image contents. Therefore, the authors propose a tag‐inferring and tag‐guided Transformer for image captioning to generate fine‐grained captions. First, a tag‐inferring encoder is proposed, which uses the tags extracted by the scene graph model to infer tags with deeper semantic information. Then, with the obtained deep tag information, a tag‐guided decoder that includes short‐term attention to improve the features of words in the sentence and gated cross‐modal attention to combine image features, tag features and language features to produce informative semantic features is proposed. Finally, the word probability distribution of all positions in the sequence is calculated to generate descriptions for the image. The experiments demonstrate that the authors' method can combine tags to obtain precise captions and that it achieves competitive performance with a 40.6% BLEU‐4 score and 135.3% CIDEr score on the MSCOCO data set.
computer science, artificial intelligence,engineering, electrical & electronic