Transformer with a Parallel Decoder for Image Captioning
Peilang Wei,Xu Liu,Jun Luo,Huayan Pu,Xiaoxu Huang,Shilong Wang,Huajun Cao,Shouhong Yang,Xu Zhuang,Jason Wang,Hong Yue,Cheng Ji,Mingliang Zhou
DOI: https://doi.org/10.1142/s0218001423540290
IF: 1.261
2024-01-01
International Journal of Pattern Recognition and Artificial Intelligence
Abstract:In this paper, a parallel decoder and a word group prediction module are proposed to speed up decoding and improve the effect of captions. The features of the image extracted by the encoder are linearly projected to different word groups, and then a unique relaxed mask matrix is designed to improve the decoding speed and the caption effect. First, since image captioning is composed of many words, sentences can also be broken down into word groups or words according to their syntactic structure, and we achieve this function through constituency parsing. Second, we make full use of the extracted features to predict the size of word groups. Then, a new embedding representing the information of the word is proposed based on word embedding. Finally, with the help of word groups, we design a mask matrix to modify the decoding process so that each step of the model can produce one or more words in parallel. Experiments on public datasets demonstrate that our method can reduce the time complexity while maintaining competitive performance.