Abstract:How to effectively annotate, organize and consequently manage the increasing amounts of images has been a grand challenge for both academia and industry.On one hand, the rapid growth of the images makes it impossible to manually annotate the data.On the other hand, the semantic gap, i.e., the lack of correspondence between low-level information a computer extract directly from the visual content and the interpretation that the same content has for a given user, makes automated annotation by computer challenging.Image captioning, which aims to automatically generate a natural-language description for a given image, helps bridge the semantic gap and consequently improves the management and accessibility of image data at a semantic level.With few exceptions, the task of image captioning has so far been explored only in the English language since publicly available datasets are mostly in this language.The application of image captioning, however, should not be bounded by language.Extending the study of image captioning in the dimension of language is essential for a large population in the planet who cannot speak English.Different from the mainstreams that focus on English sentence generation, this paper develops an image captioning system that describes images in the Chinese language.Only few studies have been conducted for image captioning in a cross-lingual setting, most of which tackle this problem by constructing new datasets in the target languages.Such an approach is constrained by the availability of manual annotation and thus difficult to scale up.Instead of building large datasets in a new language manually, we make the best use of existing English datasets to construct Chinese captioning datasets by machine translation.Moreover, to improve the quality of generated Chinese sentences, we propose to exploit automated tagging of Chinese words to provide external information for sentence generation.From English datasets we train a Multilayer Perceptron (MLP) model that can directly predict Chinese tags for a test image.Per training image we obtain its Chinese tags by first extracting noun, verb and adjective words from its original English descriptions and then performing dictionary-based translation.For the caption generation model, we follow a popular CNN+RNN approach.During the beam search for generating a sentence, the captioning model can choose several candidate words and therefore construct several candidate sentences.Using the predicted tags, we propose to improve the quality of the final generated Chinese sentence by reranking the candidate words or sentences in terms of their matches with the tags predicted by the MLP model.To evaluate the effectiveness of our proposed approach, we conduct experiments on two Chinese captioning datasets, namely Flickr8k-cn and Flickr30k-cn, using two models, which are the Google model that directly uses full-image representation; (2) an Attention model that learns to focus on specific regions of the image that are deemed to be salient according to the attention mechanism.The experimental results show that, while reranking candidate words does not bring extra improvement, reranking candidate sentences using predicted tags improves the performance of both the Google model and the Attention model.Specifically, our sentence reranking strategy improves the CIDEr score from0.474to 0.503on the test set of Flickr8k-cn, from 0.325to 0.356on the test set of Flickr30k-cn for the Google model.As for the Attention model, the strategy improves the CIDEr score from0.510to 0.536on Flickr8k-cn, from 0.392to 0.411on Flickr30k-cn.For state-of-the-art performance on Chinese image captioning, we suggest the use of the Attention model with sentence reranking using predicted tags.

Image Captioning: from Structural Tetrad to Translated Sentences

Describing Images by Feeding Lstm with Structural Words

Mmt: A Multimodal Translator For Image Captioning

Looking Deeper and Transferring Attention for Image Captioning.

Exploring Spatial-Based Position Encoding for Image Captioning

Image Caption Generation with Part of Speech Guidance

Separate Syntax and Semantics: Part-of-Speech-Guided Transformer for Image Captioning

Phrase-based Image Captioning with Hierarchical LSTM Model

Improving Image Captioning with Better Use of Caption

Improving Chinese Image Captioning by Tag Prediction

Novel Model to Integrate Word Embeddings and Syntactic Trees for Automatic Caption Generation from Images

From Captions to Visual Concepts and Back

Dual Attention on Pyramid Feature Maps for Image Captioning

CaptionNet: A Tailor-made Recurrent Neural Network for Generating Image Descriptions

Improving Image Captioning with Better Use of Captions

Captioning Transformer with Scene Graph Guiding

Scene Graph Captioner: Image Captioning Based on Structural Visual Representation

StructCap

Generating image descriptions with multidirectional 2D long short-term memory

Image Captioning with Scene-graph Based Semantic Concepts.