Research on Image Captioning Based on Vision-language Pre-trained Models

Shiwei Zou,Yuxiang Xie,Jie Yan,Yingmei Wei,Xidao Luan
DOI: https://doi.org/10.1109/BigDIA60676.2023.10429361
2023-01-01
Abstract:Image captioning is a vision-language task that targets at describing an image by generating a coherent sentence automatically. This technology allows computers to understand and describe images like humans, enabling further processing of images. It combines computer vision and natural language processing, which has broad applications. This paper investigates the use of vision-language pre-trained models for image captioning. In this article, an encoder-decoder hybrid model based on PVT and BERT is pre-trained on a large number of image-text pairs, possessing both understanding and generation capabilities. It consists of three components: 1) an image encoder and a text encoder to extract image and text features respectively; 2) a multi-modal encoder to align the two modalities; and 3) a text decoder to generate text descriptions. This paper focuses on enhancing the image encoder of the model, which is crucial for image captioning models as it extracts high-resolution image features. Reducing training time and computational costs can also improve the applicability of pre-trained models. Improvements are made to the original model by modifying the ViT model architecture of the image encoding module. A feature pyramid structure and a spatial reduction attention mechanism is introduced to extract high-resolution image features while reducing computational complexity and memory usage, thereby enhancing the universality of the model. The model adopts a pure Transformer architecture, discarding the convolutional structure, and achieves excellent performance.
What problem does this paper attempt to address?