Abstract:Objective: The Chinese description of images combines the two directions of computer vision and natural language processing. It is a typical representative of multi-mode and cross-domain problems with artificial intelligence algorithms. The image Chinese description model needs to output a Chinese description for each given test picture, describe the sentence requirements to conform to the natural language habits, and point out the important information in the image, covering the main characters, scenes, actions and other content. Since the current open source datasets are mostly in English, the research on the direction of image description is mainly in English. Chinese descriptions usually have greater flexibility in syntax and lexicalization, and the challenges of algorithm implementation are also large. Therefore, only a few people have studied image descriptions, especially Chinese descriptions. Methods: This study attempts to derive a model of image description generation from the Flickr8k-cn and Flickr30k-cn datasets. At each time period of the description, the model can decide whether to rely more on images or text information. The model captures more important information from the image to improve the richness and accuracy of the Chinese description of the image. The image description data set of this study is mainly composed of Chinese description sentences. The method consists of an encoder and a decoder. The encoder is based on a convolutional neura l network. The decoder is based on a long-short memory network and is composed of a multi-modal summary generation network. Results: Experiments on Flickr8k-cn and Flickr30k-cn Chinese datasets show that the proposed method is superior to the existing Chinese abstract generation model. Conclusion: The method proposed in this paper is effective, and the performance has been greatly improved on the basis of the benchmark model. Compared with the existing Chinese abstract generation model, its performance is also superior. In the next step, more visual prior information will be incorporated into the model, such as the action category, the relationship between the object and the object, etc., to further improve the quality of the description sentence, and achieve the effect of “seeing the picture writing”.

Can A Machine Generate Humanlike Language Descriptions for A Remote Sensing Image?

Exploring Models and Data for Remote Sensing Image Caption Generation

Ontology-Guided Image Interpretation For Geobia Of High Spatial Resolution Remote Sense Imagery: A Coastal Area Case Study

Towards Automatic Satellite Images Captions Generation Using Large Language Models

Machine-to-Machine Visual Dialoguing with ChatGPT for Enriched Textual Image Description

Discoverability in Satellite Imagery: A Good Sentence is Worth a Thousand Pictures

Deep Semantic Understanding of High Resolution Remote Sensing Image

From Captions to Visual Concepts and Back

Application of Dual Attention Mechanism in Chinese Image Captioning

Image Captioning with Object Detection and Localization.

Human-like Controllable Image Captioning with Verb-specific Semantic Roles

Deep Learning for Image-to-Text Generation: A Technical Overview

A TextGCN-Based Decoding Approach for Improving Remote Sensing Image Captioning

Automatic Image Description Generation with Emotional Classifiers

From Pixels to Prose: Advancing Multi-Modal Language Models for Remote Sensing

Large Language Models for Captioning and Retrieving Remote Sensing Images

Automatic Caption Generation for News Images

Remote Sensing ChatGPT: Solving Remote Sensing Tasks with ChatGPT and Visual Models

Enhancing Image Description Generation through Deep Reinforcement Learning: Fusing Multiple Visual Features and Reward Mechanisms

Translating SAR to Optical Images for Assisted Interpretation

A Review of Deep Learning-Based Remote Sensing Image Caption: Methods, Models, Comparisons and Future Directions