Abstract:As a crossing domain of computer vision and natural language processing, the image caption generation has been an active research topic in recent years, which contributes to the multimodal social media translation from unstructured image data to structured text data. The conventional research works have proposed a series of image captioning methods, such as template-based, retrieval-based, encode-decode. Among these methods, the one with encode-decode framework is widely used in the image caption generation, in which the encoder extracts the image features by Convolutional Neural Network (CNN), and the decoder adopts Recurrent Neural Network (RNN) to generate the image description. The Neural Image Caption (NIC) model has achieved good performance in image captioning, and however, there still remains some challenges to be addressed. To tackle the challenges of the lack of image information and the deviation from the core content of the image, our proposed model explores visual attention to deepen the understanding of the image, incorporating the image labels generated by Fully Convolutional Network (FCN) into the generation of image caption. Furthermore, our proposed model exploits textual attention to increase the integrity of the information. Finally, the label generation, attached to the textual attention mechanism, and the image caption generation, have been merged to form an end-to end trainable framework. In this paper, extensive experiments have been carried out on the AIC-ICC image caption benchmark dataset, and the experimental results show that our proposed model is effective and feasible in the image caption generation.

Cascade Recurrent Neural Network for Image Caption Generation

Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)

Intelligent image captioning

Cascaded Revision Network for Novel Object Captioning

Bidirectional Multimodal Recurrent Neural Networks with Refined Visual Features for Image Captioning.

c-RNN: A Fine-Grained Language Model for Image Captioning

Cascade Attention: Multiple Feature Based Learning for Image Captioning

Recurrent Image Captioner: Describing Images with Spatial-Invariant Transformation and Attention Filtering

Recurrent Fusion Network for Image Captioning.

Recurrent fusion network for image captioning

CaptionNet: A Tailor-made Recurrent Neural Network for Generating Image Descriptions

Evolutionary Recurrent Neural Network for Image Captioning.

A Parallel-Fusion RNN-LSTM Architecture for Image Caption Generation

Visual Attention Based on Long-Short Term Memory Model for Image Caption Generation

Multi-scale Hierarchical Residual Network for Dense Captioning.

Recurrent Attention LSTM Model for Image Chinese Caption Generation

A Deep Neural Framework for Image Caption Generation Using GRU-Based Attention Mechanism

Image caption generation with dual attention mechanism

Dual-Stream Recurrent Neural Network for Video Captioning

What is the Role of Recurrent Neural Networks (RNNs) in an Image Caption Generator?

An Empirical Study of Language CNN for Image Captioning