Abstract:How to effectively annotate, organize and consequently manage the increasing amounts of images has been a grand challenge for both academia and industry.On one hand, the rapid growth of the images makes it impossible to manually annotate the data.On the other hand, the semantic gap, i.e., the lack of correspondence between low-level information a computer extract directly from the visual content and the interpretation that the same content has for a given user, makes automated annotation by computer challenging.Image captioning, which aims to automatically generate a natural-language description for a given image, helps bridge the semantic gap and consequently improves the management and accessibility of image data at a semantic level.With few exceptions, the task of image captioning has so far been explored only in the English language since publicly available datasets are mostly in this language.The application of image captioning, however, should not be bounded by language.Extending the study of image captioning in the dimension of language is essential for a large population in the planet who cannot speak English.Different from the mainstreams that focus on English sentence generation, this paper develops an image captioning system that describes images in the Chinese language.Only few studies have been conducted for image captioning in a cross-lingual setting, most of which tackle this problem by constructing new datasets in the target languages.Such an approach is constrained by the availability of manual annotation and thus difficult to scale up.Instead of building large datasets in a new language manually, we make the best use of existing English datasets to construct Chinese captioning datasets by machine translation.Moreover, to improve the quality of generated Chinese sentences, we propose to exploit automated tagging of Chinese words to provide external information for sentence generation.From English datasets we train a Multilayer Perceptron (MLP) model that can directly predict Chinese tags for a test image.Per training image we obtain its Chinese tags by first extracting noun, verb and adjective words from its original English descriptions and then performing dictionary-based translation.For the caption generation model, we follow a popular CNN+RNN approach.During the beam search for generating a sentence, the captioning model can choose several candidate words and therefore construct several candidate sentences.Using the predicted tags, we propose to improve the quality of the final generated Chinese sentence by reranking the candidate words or sentences in terms of their matches with the tags predicted by the MLP model.To evaluate the effectiveness of our proposed approach, we conduct experiments on two Chinese captioning datasets, namely Flickr8k-cn and Flickr30k-cn, using two models, which are the Google model that directly uses full-image representation; (2) an Attention model that learns to focus on specific regions of the image that are deemed to be salient according to the attention mechanism.The experimental results show that, while reranking candidate words does not bring extra improvement, reranking candidate sentences using predicted tags improves the performance of both the Google model and the Attention model.Specifically, our sentence reranking strategy improves the CIDEr score from0.474to 0.503on the test set of Flickr8k-cn, from 0.325to 0.356on the test set of Flickr30k-cn for the Google model.As for the Attention model, the strategy improves the CIDEr score from0.510to 0.536on Flickr8k-cn, from 0.392to 0.411on Flickr30k-cn.For state-of-the-art performance on Chinese image captioning, we suggest the use of the Attention model with sentence reranking using predicted tags.

Actor-Critic Sequence Training for Image Captioning

Self-critical n-step Training for Image Captioning

Image Captioning with Residual Swin Transformer and Actor-Critic

Teacher-Critical Training Strategies for Image Captioning

Image Captioning with Partially Rewarded Imitation Learning.

Human-like Controllable Image Captioning with Verb-specific Semantic Roles

Exploring Spatial-Based Position Encoding for Image Captioning

Explicit Image Caption Reasoning: Generating Accurate and Informative Captions for Complex Scenes with LMM

Recurrent Image Captioner: Describing Images with Spatial-Invariant Transformation and Attention Filtering

Stack-Captioning: Coarse-to-Fine Learning for Image Captioning.

Revisiting Image Captioning Training Paradigm via Direct CLIP-based Optimization

Attribute Assisted Teacher-Critical Training Strategies for Image Captioning

Self-Annotated Training for Controllable Image Captioning

A Deep Reinforced Training Method For Location-Based Image Captioning

Learning to Evaluate Image Captioning

Modeling Coherence and Diversity for Image Paragraph Captioning

Improving Chinese Image Captioning by Tag Prediction

Contrastive Semantic Similarity Learning for Image Captioning Evaluation

Fluent and Accurate Image Captioning with a Self-Trained Reward Model

Improving Image Captioning with Better Use of Caption