Abstract:Image captioning, i.e., generating the natural semantic descriptions of given image, is an essential task for machines to understand the content of the image. Remote sensing image captioning is a part of the field. Most of the current remote sensing image captioning models suffered the overfitting problem and failed to utilize the semantic information in images. To this end, we propose a Variational Autoencoder and Reinforcement Learning based Two-stage Multi-task Learning Model (VRTMM) for the remote sensing image captioning task. In the first stage, we finetune the CNN jointly with the Variational Autoencoder. In the second stage, the Transformer generates the text description using both spatial and semantic features. Reinforcement Learning is then applied to enhance the quality of the generated sentences. Our model surpasses the previous state of the art records by a large margin on all seven scores on Remote Sensing Image Caption Dataset. The experiment result indicates our model is effective on remote sensing image captioning and achieves the new state-of-the-art result.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the over - fitting problem in remote - sensing image captioning and the failure of existing models to fully utilize the semantic information in the images. Specifically, the author points out that the existing remote - sensing image captioning models do a lot in terms of using image features, but not enough in capturing the semantic meanings and associations between different objects. In addition, directly using a CNN pre - trained on the ImageNet dataset as an encoder may not be suitable for remote - sensing images because these images lack some prominent objects and many objects are equally important. Therefore, a new method is required to more effectively extract and utilize the features in remote - sensing images. To solve these problems, the author proposes a two - stage multi - task learning model (VRTMM) based on variational auto - encoder (VAE) and reinforcement learning (RL) for the remote - sensing image captioning task. The main contributions of this model include: 1. **Introduction of VAE**: Regularize the shared encoder by reconstructing the input image to more effectively extract image features. VAE can avoid over - fitting and ensure that the latent space has good properties for generating new data. This helps to alleviate the over - fitting problem caused by the lack of remote - sensing image data and helps the pre - trained CNN better represent the given remote - sensing image. 2. **Simultaneous utilization of low - level and high - level image features**: Significantly improves the performance of image captioning by simultaneously using low - level and high - level image features. High - level features contain more semantic information, while low - level features focus on details. Combining these two types of features can make them complementary and improve the overall performance of the model. 3. **Addition of self - attention mechanism**: Enhances the quality of the final text description by adding a self - attention mechanism on the spatial features. The self - attention mechanism can better represent the semantic information of different regions, thereby generating more natural and fluent text descriptions. Through these improvements, the author's model outperforms the existing state - of - the - art models in the remote - sensing image captioning task and achieves significant improvements on multiple evaluation metrics.

Remote sensing image captioning via Variational Autoencoder and Reinforcement Learning

A Patch-Level Region-Aware Module with a Multi-Label Framework for Remote Sensing Image Captioning

Video captioning based on vision transformer and reinforcement learning

Exploring Models and Data for Remote Sensing Image Caption Generation

Recurrent Attention and Semantic Gate for Remote Sensing Image Captioning

Large Language Models for Captioning and Retrieving Remote Sensing Images

An image caption model based on attention mechanism and deep reinforcement learning

Improving Remote Sensing Image Captioning by Combining Grid Features and Transformer

Variational Transformer: A Framework Beyond the Tradeoff Between Accuracy and Diversity for Image Captioning

Towards Automatic Satellite Images Captions Generation Using Large Language Models

Deep Semantic Understanding of High Resolution Remote Sensing Image

Remote Sensing Image Change Captioning With Dual-Branch Transformers: A New Method and a Large Scale Dataset

Remote Sensing Vision-Language Foundation Models without Annotations via Ground Remote Alignment

Variational Transformer: A Framework Beyond the Trade-off Between Accuracy and Diversity for Image Captioning

From Plane to Hierarchy: Deformable Transformer for Remote Sensing Image Captioning

Semantic-CC: Boosting Remote Sensing Image Change Captioning via Foundational Knowledge and Semantic Guidance

Cross-Modal Retrieval and Semantic Refinement for Remote Sensing Image Captioning

A TextGCN-Based Decoding Approach for Improving Remote Sensing Image Captioning

Advancements in Visual Language Models for Remote Sensing: Datasets, Capabilities, and Enhancement Techniques

VHM: Versatile and Honest Vision Language Model for Remote Sensing Image Analysis

Changes to Captions: An Attentive Network for Remote Sensing Change Captioning