Abstract:Image Captioning (IC) has achieved astonishing developments by incorporating various techniques into the CNN-RNN encoder-decoder architecture. However, since CNN and RNN do not share the basic network component, such a heterogeneous pipeline is hard to be trained end-to-end where the visual encoder will not learn anything from the caption supervision. This drawback inspires the researchers to develop a homogeneous architecture that facilitates end-to-end training, for which Transformer is the perfect one that has proven its huge potential in both vision and language domains and thus can be used as the basic component of the visual encoder and language decoder in an IC pipeline. Meantime, self-supervised learning releases the power of the Transformer architecture that a pre-trained large-scale one can be generalized to various tasks including IC. The success of these large-scale models seems to weaken the importance of the single IC task. However, we demonstrate that IC still has its specific significance in this age by analyzing the connections between IC with some popular self-supervised learning paradigms. Due to the page limitation, we only refer to highly important papers in this short survey and more related works can be found at <a class="link-external link-https" href="https://github.com/SjokerLily/awesome-image-captioning" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the heterogeneity between the visual encoder and the language decoder in the existing Image Captioning (IC) models. Specifically: 1. **Limitations of Heterogeneous Architectures**: - Existing IC models usually use CNN as the visual encoder and RNN as the language decoder. This heterogeneous architecture makes it difficult for the entire model to be trained end - to - end. - Since the basic network components of CNN and RNN are different, it is difficult to unify the optimization strategies (such as optimizers or learning rates), resulting in the visual encoder being unable to learn high - level semantic knowledge from caption supervision. 2. **Challenges of End - to - End Training**: - In the heterogeneous architecture, the visual encoder is fixed after pre - training and cannot learn new knowledge from caption supervision. This makes it impossible for the gradient to be back - propagated from word - level supervision to pixel - level input, thus limiting the overall performance improvement of the model. 3. **Advantages of the Transformer Architecture**: - The paper proposes that using the Transformer architecture can build a homogeneous encoder - decoder framework, where both the visual encoder and the language decoder are based on Transformer. - The Transformer architecture has achieved remarkable success in the fields of natural language processing and computer vision, and has strong cross - modal modeling capabilities, being able to capture dense correlations and long - distance dependencies. 4. **The Role of Self - Supervised Learning**: - Self - supervised learning unleashes the potential of the Transformer architecture. Through large - scale pre - training models, it can be generalized to various tasks, including image captioning. - Although large - scale pre - training models perform well on multiple tasks, the paper emphasizes that the IC task still has specific importance and analyzes the connection between it and large - scale pre - training models. 5. **Future Development Directions**: - The paper explores how to further improve the IC architecture in the Transformer era, including building stronger visual encoders, designing more advanced attention mechanisms, and fusing visual and language structures. In summary, this paper aims to solve the heterogeneity problem in existing IC models by introducing a homogeneous Transformer architecture, achieve end - to - end training, and explore the unique significance of the IC task in the era of large - scale pre - training.

Image Captioning In the Transformer Age

End-to-End Transformer Based Model for Image Captioning

Recurrent Image Captioner: Describing Images with Spatial-Invariant Transformation and Attention Filtering

Improving Remote Sensing Image Captioning by Combining Grid Features and Transformer

Entangled Transformer for Image Captioning

Context-Aware Transformer for image captioning

Transformer with multi-level grid features and depth pooling for image captioning

A Multiscale Grouping Transformer With CLIP Latents for Remote Sensing Image Captioning

Improved image captioning with subword units training and transformer

Controllable image caption with an encoder-decoder optimization structure

Insights into Object Semantics: Leveraging Transformer Networks for Advanced Image Captioning

Exploring better image captioning with grid features

Tag‐inferring and tag‐guided Transformer for image captioning

CTFCD: Channel Transformer Based on Full Convolutional Decoder for Single Image Deraining

BENet: bi-directional enhanced network for image captioning

Multimodal Transformer With Multi-View Visual Representation for Image Captioning

TypeFormer: Multiscale Transformer With Type Controller for Remote Sensing Image Caption

From Plane to Hierarchy: Deformable Transformer for Remote Sensing Image Captioning

Improving Image Captioning by Leveraging Intra- and Inter-layer Global Representation in Transformer Network

Dual visual align-cross attention-based image captioning transformer

Pre-Trained CNN Architecture Analysis for Transformer-Based Indonesian Image Caption Generation Model