Abstract:How to effectively annotate, organize and consequently manage the increasing amounts of images has been a grand challenge for both academia and industry.On one hand, the rapid growth of the images makes it impossible to manually annotate the data.On the other hand, the semantic gap, i.e., the lack of correspondence between low-level information a computer extract directly from the visual content and the interpretation that the same content has for a given user, makes automated annotation by computer challenging.Image captioning, which aims to automatically generate a natural-language description for a given image, helps bridge the semantic gap and consequently improves the management and accessibility of image data at a semantic level.With few exceptions, the task of image captioning has so far been explored only in the English language since publicly available datasets are mostly in this language.The application of image captioning, however, should not be bounded by language.Extending the study of image captioning in the dimension of language is essential for a large population in the planet who cannot speak English.Different from the mainstreams that focus on English sentence generation, this paper develops an image captioning system that describes images in the Chinese language.Only few studies have been conducted for image captioning in a cross-lingual setting, most of which tackle this problem by constructing new datasets in the target languages.Such an approach is constrained by the availability of manual annotation and thus difficult to scale up.Instead of building large datasets in a new language manually, we make the best use of existing English datasets to construct Chinese captioning datasets by machine translation.Moreover, to improve the quality of generated Chinese sentences, we propose to exploit automated tagging of Chinese words to provide external information for sentence generation.From English datasets we train a Multilayer Perceptron (MLP) model that can directly predict Chinese tags for a test image.Per training image we obtain its Chinese tags by first extracting noun, verb and adjective words from its original English descriptions and then performing dictionary-based translation.For the caption generation model, we follow a popular CNN+RNN approach.During the beam search for generating a sentence, the captioning model can choose several candidate words and therefore construct several candidate sentences.Using the predicted tags, we propose to improve the quality of the final generated Chinese sentence by reranking the candidate words or sentences in terms of their matches with the tags predicted by the MLP model.To evaluate the effectiveness of our proposed approach, we conduct experiments on two Chinese captioning datasets, namely Flickr8k-cn and Flickr30k-cn, using two models, which are the Google model that directly uses full-image representation; (2) an Attention model that learns to focus on specific regions of the image that are deemed to be salient according to the attention mechanism.The experimental results show that, while reranking candidate words does not bring extra improvement, reranking candidate sentences using predicted tags improves the performance of both the Google model and the Attention model.Specifically, our sentence reranking strategy improves the CIDEr score from0.474to 0.503on the test set of Flickr8k-cn, from 0.325to 0.356on the test set of Flickr30k-cn for the Google model.As for the Attention model, the strategy improves the CIDEr score from0.510to 0.536on Flickr8k-cn, from 0.392to 0.411on Flickr30k-cn.For state-of-the-art performance on Chinese image captioning, we suggest the use of the Attention model with sentence reranking using predicted tags.

CaptionNet: A Tailor-made Recurrent Neural Network for Generating Image Descriptions

Visual Attention Based on Long-Short Term Memory Model for Image Caption Generation

Recurrent Image Captioner: Describing Images with Spatial-Invariant Transformation and Attention Filtering

An image caption model based on attention mechanism and deep reinforcement learning

Image Caption Description of Traffic Scene Based on Deep Learning

Gated Object-Attribute Matching Network for Detailed Image Caption

Looking Deeper and Transferring Attention for Image Captioning.

Chinese Image Captioning Via Fuzzy Attention-based DenseNet-BiLSTM

CA-Captioner: A Novel Concentrated Attention for Image Captioning

Image Captioning with Bidirectional Semantic Attention-Based Guiding of Long Short-Term Memory

Image Captioning with Object Detection and Localization.

Image Caption Generator Using Deep Learning

Neural Image Caption Generation with Weighted Training and Reference

Attention Based Sequence-to-sequence Framework for Auto Image Caption Generation

Image Captioning Using DenseNet Network and Adaptive Attention.

Attend to Knowledge: Memory-Enhanced Attention Network for Image Captioning.

Image Captioning with Memorized Knowledge

AttResNet: Attention-based ResNet for Image Captioning

Image Captioning with Deep LSTM Based on Sequential Residual.

Cascaded Revision Network for Novel Object Captioning

Improving Chinese Image Captioning by Tag Prediction