Abstract:How to effectively annotate, organize and consequently manage the increasing amounts of images has been a grand challenge for both academia and industry.On one hand, the rapid growth of the images makes it impossible to manually annotate the data.On the other hand, the semantic gap, i.e., the lack of correspondence between low-level information a computer extract directly from the visual content and the interpretation that the same content has for a given user, makes automated annotation by computer challenging.Image captioning, which aims to automatically generate a natural-language description for a given image, helps bridge the semantic gap and consequently improves the management and accessibility of image data at a semantic level.With few exceptions, the task of image captioning has so far been explored only in the English language since publicly available datasets are mostly in this language.The application of image captioning, however, should not be bounded by language.Extending the study of image captioning in the dimension of language is essential for a large population in the planet who cannot speak English.Different from the mainstreams that focus on English sentence generation, this paper develops an image captioning system that describes images in the Chinese language.Only few studies have been conducted for image captioning in a cross-lingual setting, most of which tackle this problem by constructing new datasets in the target languages.Such an approach is constrained by the availability of manual annotation and thus difficult to scale up.Instead of building large datasets in a new language manually, we make the best use of existing English datasets to construct Chinese captioning datasets by machine translation.Moreover, to improve the quality of generated Chinese sentences, we propose to exploit automated tagging of Chinese words to provide external information for sentence generation.From English datasets we train a Multilayer Perceptron (MLP) model that can directly predict Chinese tags for a test image.Per training image we obtain its Chinese tags by first extracting noun, verb and adjective words from its original English descriptions and then performing dictionary-based translation.For the caption generation model, we follow a popular CNN+RNN approach.During the beam search for generating a sentence, the captioning model can choose several candidate words and therefore construct several candidate sentences.Using the predicted tags, we propose to improve the quality of the final generated Chinese sentence by reranking the candidate words or sentences in terms of their matches with the tags predicted by the MLP model.To evaluate the effectiveness of our proposed approach, we conduct experiments on two Chinese captioning datasets, namely Flickr8k-cn and Flickr30k-cn, using two models, which are the Google model that directly uses full-image representation; (2) an Attention model that learns to focus on specific regions of the image that are deemed to be salient according to the attention mechanism.The experimental results show that, while reranking candidate words does not bring extra improvement, reranking candidate sentences using predicted tags improves the performance of both the Google model and the Attention model.Specifically, our sentence reranking strategy improves the CIDEr score from0.474to 0.503on the test set of Flickr8k-cn, from 0.325to 0.356on the test set of Flickr30k-cn for the Google model.As for the Attention model, the strategy improves the CIDEr score from0.510to 0.536on Flickr8k-cn, from 0.392to 0.411on Flickr30k-cn.For state-of-the-art performance on Chinese image captioning, we suggest the use of the Attention model with sentence reranking using predicted tags.

The Solution for the CVPR2024 NICE Image Captioning Challenge

The Solution for the CVPR2023 NICE Image Captioning Challenge

NICE: CVPR 2023 Challenge on Zero-shot Image Captioning

The Solution for the ICCV 2023 1st Scientific Figure Captioning Challenge

CA-Captioner: A Novel Concentrated Attention for Image Captioning

Improving Chinese Image Captioning by Tag Prediction

Rich Image Captioning in the Wild

Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via Text-Only Training

Zero-TextCap: Zero-shot Framework for Text-based Image Captioning

Vatex Video Captioning Challenge 2020: Multi-View Features and Hybrid Reward Strategies for Video Captioning

Improving Image Captioning with Better Use of Captions

Visuals to Text: A Comprehensive Review on Automatic Image Captioning

Image-Caption Encoding for Improving Zero-Shot Generalization

Advancements in Deep Learning-Based Image Captioning

Technical Report of NICE Challenge at CVPR 2024: Caption Re-ranking Evaluation Using Ensembled CLIP and Consensus Scores

Quality-agnostic Image Captioning to Safely Assist People with Vision Impairment

Improving Image Captioning with Better Use of Caption

CLIP4Caption ++: Multi-CLIP for Video Caption

Entrocap: Zero-Shot Image Captioning with Entropy-Based Retrieval

An image caption model based on attention mechanism and deep reinforcement learning

Image Caption Method from Coarse to Fine Based on Dual Encoder-Decoder Framework