Abstract:How to effectively annotate, organize and consequently manage the increasing amounts of images has been a grand challenge for both academia and industry.On one hand, the rapid growth of the images makes it impossible to manually annotate the data.On the other hand, the semantic gap, i.e., the lack of correspondence between low-level information a computer extract directly from the visual content and the interpretation that the same content has for a given user, makes automated annotation by computer challenging.Image captioning, which aims to automatically generate a natural-language description for a given image, helps bridge the semantic gap and consequently improves the management and accessibility of image data at a semantic level.With few exceptions, the task of image captioning has so far been explored only in the English language since publicly available datasets are mostly in this language.The application of image captioning, however, should not be bounded by language.Extending the study of image captioning in the dimension of language is essential for a large population in the planet who cannot speak English.Different from the mainstreams that focus on English sentence generation, this paper develops an image captioning system that describes images in the Chinese language.Only few studies have been conducted for image captioning in a cross-lingual setting, most of which tackle this problem by constructing new datasets in the target languages.Such an approach is constrained by the availability of manual annotation and thus difficult to scale up.Instead of building large datasets in a new language manually, we make the best use of existing English datasets to construct Chinese captioning datasets by machine translation.Moreover, to improve the quality of generated Chinese sentences, we propose to exploit automated tagging of Chinese words to provide external information for sentence generation.From English datasets we train a Multilayer Perceptron (MLP) model that can directly predict Chinese tags for a test image.Per training image we obtain its Chinese tags by first extracting noun, verb and adjective words from its original English descriptions and then performing dictionary-based translation.For the caption generation model, we follow a popular CNN+RNN approach.During the beam search for generating a sentence, the captioning model can choose several candidate words and therefore construct several candidate sentences.Using the predicted tags, we propose to improve the quality of the final generated Chinese sentence by reranking the candidate words or sentences in terms of their matches with the tags predicted by the MLP model.To evaluate the effectiveness of our proposed approach, we conduct experiments on two Chinese captioning datasets, namely Flickr8k-cn and Flickr30k-cn, using two models, which are the Google model that directly uses full-image representation; (2) an Attention model that learns to focus on specific regions of the image that are deemed to be salient according to the attention mechanism.The experimental results show that, while reranking candidate words does not bring extra improvement, reranking candidate sentences using predicted tags improves the performance of both the Google model and the Attention model.Specifically, our sentence reranking strategy improves the CIDEr score from0.474to 0.503on the test set of Flickr8k-cn, from 0.325to 0.356on the test set of Flickr30k-cn for the Google model.As for the Attention model, the strategy improves the CIDEr score from0.510to 0.536on Flickr8k-cn, from 0.392to 0.411on Flickr30k-cn.For state-of-the-art performance on Chinese image captioning, we suggest the use of the Attention model with sentence reranking using predicted tags.

Leveraging Unpaired Out-of-domain Data for Image Captioning.

Cross-Domain Image Captioning Via Cross-Modal Retrieval and Model Adaptation

Exploring Spatial-Based Position Encoding for Image Captioning

Dual Learning for Cross-domain Image Captioning.

Unpaired Image Captioning With semantic-Constrained Self-Learning

Object-Centric Unsupervised Image Captioning

Improving Image Captioning with Better Use of Caption

Improving Multimodal Datasets with Image Captioning

Improving Image Captioning with Better Use of Captions

DIR: Retrieval-Augmented Image Captioning with Comprehensive Understanding

Image Captioning with Multi-Context Synthetic Data

Multimodal Data Augmentation for Image Captioning using Diffusion Models

Exploiting Cross-Modal Prediction and Relation Consistency for Semisupervised Image Captioning

Enhancing Image Captioning Using Deep Convolutional Generative Adversarial Networks

Remote Sensing Image Captioning with Sequential Attention and Flexible Word Correlation

Improving Chinese Image Captioning by Tag Prediction

Semi-Supervised Image Captioning by Adversarially Propagating Labeled Data

An image caption model based on attention mechanism and deep reinforcement learning

Augment the Pairs: Semantics-Preserving Image-Caption Pair Augmentation for Grounding-Based Vision and Language Models