Llafn-Generator: Learnable Linear-Attention with Fast-Normalization for Large-Scale Image Captioning

Xiaobao Yang,Xi Tian,Junsheng Wu,Xiaochun Yang,Sugang Ma,Xinman Qi,Zhiqiang Hou
DOI: https://doi.org/10.1016/j.cviu.2024.104088
IF: 4.886
2024-01-01
Computer Vision and Image Understanding
Abstract:Recently, although Transformer has widespread application in the field of computer vision, the quadratic complexity of its Self-Attention hindered the processing in large-scale image captioning task. Therefore, in this paper, we propose a Learnable Linear-Attention with Fast-Normalization for Large-Scale Image Captioning (dubbed as LLAFN-Generator). Firstly, it introduces a Learnable Linear-Attention (LLA) module to solve the weight score learning of large-scale images, which is simply implemented through two linear layers and greatly reduces the computation complexity. Meanwhile, the Fast-Normalization (FN) method is employed in the Learnable Linear-Attention instead of the original Softmax function to improve the computational speed. Additionally, the feature enhancement module be used to compensate for the shallow, fine-grained information in order to enhance the feature representation of the model. Finally, extensive experiments on the MS COCO dataset show that the computational complexity is reduced by 30% and the parameter is reduced by 20% on models of the same size, with the performance metrics BLEU_1 and CIDEr increasing by 1.2% and 3.6%, respectively.
What problem does this paper attempt to address?