Image Captioning In the Transformer Age

Yang Xu,Li Li,Haiyang Xu,Songfang Huang,Fei Huang,Jianfei Cai
DOI: https://doi.org/10.48550/arXiv.2204.07374
2022-04-15
Abstract:Image Captioning (IC) has achieved astonishing developments by incorporating various techniques into the CNN-RNN encoder-decoder architecture. However, since CNN and RNN do not share the basic network component, such a heterogeneous pipeline is hard to be trained end-to-end where the visual encoder will not learn anything from the caption supervision. This drawback inspires the researchers to develop a homogeneous architecture that facilitates end-to-end training, for which Transformer is the perfect one that has proven its huge potential in both vision and language domains and thus can be used as the basic component of the visual encoder and language decoder in an IC pipeline. Meantime, self-supervised learning releases the power of the Transformer architecture that a pre-trained large-scale one can be generalized to various tasks including IC. The success of these large-scale models seems to weaken the importance of the single IC task. However, we demonstrate that IC still has its specific significance in this age by analyzing the connections between IC with some popular self-supervised learning paradigms. Due to the page limitation, we only refer to highly important papers in this short survey and more related works can be found at <a class="link-external link-https" href="https://github.com/SjokerLily/awesome-image-captioning" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the heterogeneity between the visual encoder and the language decoder in the existing Image Captioning (IC) models. Specifically: 1. **Limitations of Heterogeneous Architectures**: - Existing IC models usually use CNN as the visual encoder and RNN as the language decoder. This heterogeneous architecture makes it difficult for the entire model to be trained end - to - end. - Since the basic network components of CNN and RNN are different, it is difficult to unify the optimization strategies (such as optimizers or learning rates), resulting in the visual encoder being unable to learn high - level semantic knowledge from caption supervision. 2. **Challenges of End - to - End Training**: - In the heterogeneous architecture, the visual encoder is fixed after pre - training and cannot learn new knowledge from caption supervision. This makes it impossible for the gradient to be back - propagated from word - level supervision to pixel - level input, thus limiting the overall performance improvement of the model. 3. **Advantages of the Transformer Architecture**: - The paper proposes that using the Transformer architecture can build a homogeneous encoder - decoder framework, where both the visual encoder and the language decoder are based on Transformer. - The Transformer architecture has achieved remarkable success in the fields of natural language processing and computer vision, and has strong cross - modal modeling capabilities, being able to capture dense correlations and long - distance dependencies. 4. **The Role of Self - Supervised Learning**: - Self - supervised learning unleashes the potential of the Transformer architecture. Through large - scale pre - training models, it can be generalized to various tasks, including image captioning. - Although large - scale pre - training models perform well on multiple tasks, the paper emphasizes that the IC task still has specific importance and analyzes the connection between it and large - scale pre - training models. 5. **Future Development Directions**: - The paper explores how to further improve the IC architecture in the Transformer era, including building stronger visual encoders, designing more advanced attention mechanisms, and fusing visual and language structures. In summary, this paper aims to solve the heterogeneity problem in existing IC models by introducing a homogeneous Transformer architecture, achieve end - to - end training, and explore the unique significance of the IC task in the era of large - scale pre - training.