SPT: Spatial Pyramid Transformer for Image Captioning

Haonan Zhang,Pengpeng Zeng,Lianli Gao,Xinyu Lyu,Jingkuan Song,Heng Tao Shen
DOI: https://doi.org/10.1109/tcsvt.2023.3336371
IF: 5.859
2024-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:The existing approaches to image captioning tend to adopt Transformer-based architectures with grid features, which represent the state-of-the-art. However, the strategies are prone to address the grid features with a fixed resolution, which often hampers the perception of entities with various scales. In addition, directly applying them may also result in spatial and fine-grained semantic information loss. To this end, we propose a simple yet effective method, named Spatial Pyramid Transformer (SPT). Specifically, it adopts several parameter-shared pyramid structures to perform semantic interactions across different grid resolutions. In each layer, we design a Spatial-aware Pseudo-supervised (SP) module, which aims to adaptively resort to disrupted spatial information among flatted grid features. Moreover, to maintain the model size and enhance semantics, we build a simple weighted residual connection termed as Scale-wise Reinforcement (SR) module to simultaneously explore both low- and high-level encoded features. Extensive experiments on the MS-COCO benchmark demonstrate that our method achieves new state-of-the-art performance without bringing excessive parameters compared with vanilla transformer. In addition, our method is extended to the video captioning task, which further proves the practicability of the proposed method. Code is available at https://github.com/zchoi/SPT.
What problem does this paper attempt to address?