LG-MLFormer: Local and Global MLP for Image Captioning

Zetao Jiang,Xiuxian Wang,Zhongyi Zhai,Bo Cheng
DOI: https://doi.org/10.1007/s13735-023-00266-9
2023-01-01
International Journal of Multimedia Information Retrieval
Abstract:Self-attention-based image captioning model exists visual features’ spatial information loss problem, introducing relative position encoding can solve the problem to some extent. However, it will bring additional parameters and greater computational complexity. To solve the above problem, we propose a novel local–global MLFormer (LG-MLFormer) with specifically designed encoder module Local–global multi-layer perceptron (LG-MLP). The LG-MLP can capture the latent correlations between different images and its linear stacking calculation mode can reduce computational complexity. It consists of two independent local MLP (LM) modules and a cross-domain global MLP (CDGM) module. The LM specially designs the mapping dimension between linear layers to realize the self-compensation of visual features’ spatial information without introducing relative position encoding. The CDGM module aggregates cross-domain potential correlations between grid-based features and region-based features to realize the complementary advantages of these global and local semantic associations. Experiments on the Karpathy test split and the online test server reveal that our approach provides superior or comparable performance to the state-of-the-art (SOTA). Trained models and code for reproducing the experiments are publicly available at: https://github.com/wxx1921/LGMLFormer-local-and-global-mlp-for-image-captioning .
What problem does this paper attempt to address?