Remote sensing image captioning via Variational Autoencoder and Reinforcement Learning

Xiangqing Shen,Bing Liu,Yong Zhou,Jiaqi Zhao,Mingming Liu
DOI: https://doi.org/10.1016/j.knosys.2020.105920
2020-09-01
Abstract:Image captioning, i.e., generating the natural semantic descriptions of given image, is an essential task for machines to understand the content of the image. Remote sensing image captioning is a part of the field. Most of the current remote sensing image captioning models suffered the overfitting problem and failed to utilize the semantic information in images. To this end, we propose a Variational Autoencoder and Reinforcement Learning based Two-stage Multi-task Learning Model (VRTMM) for the remote sensing image captioning task. In the first stage, we finetune the CNN jointly with the Variational Autoencoder. In the second stage, the Transformer generates the text description using both spatial and semantic features. Reinforcement Learning is then applied to enhance the quality of the generated sentences. Our model surpasses the previous state of the art records by a large margin on all seven scores on Remote Sensing Image Caption Dataset. The experiment result indicates our model is effective on remote sensing image captioning and achieves the new state-of-the-art result.
computer science, artificial intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the over - fitting problem in remote - sensing image captioning and the failure of existing models to fully utilize the semantic information in the images. Specifically, the author points out that the existing remote - sensing image captioning models do a lot in terms of using image features, but not enough in capturing the semantic meanings and associations between different objects. In addition, directly using a CNN pre - trained on the ImageNet dataset as an encoder may not be suitable for remote - sensing images because these images lack some prominent objects and many objects are equally important. Therefore, a new method is required to more effectively extract and utilize the features in remote - sensing images. To solve these problems, the author proposes a two - stage multi - task learning model (VRTMM) based on variational auto - encoder (VAE) and reinforcement learning (RL) for the remote - sensing image captioning task. The main contributions of this model include: 1. **Introduction of VAE**: Regularize the shared encoder by reconstructing the input image to more effectively extract image features. VAE can avoid over - fitting and ensure that the latent space has good properties for generating new data. This helps to alleviate the over - fitting problem caused by the lack of remote - sensing image data and helps the pre - trained CNN better represent the given remote - sensing image. 2. **Simultaneous utilization of low - level and high - level image features**: Significantly improves the performance of image captioning by simultaneously using low - level and high - level image features. High - level features contain more semantic information, while low - level features focus on details. Combining these two types of features can make them complementary and improve the overall performance of the model. 3. **Addition of self - attention mechanism**: Enhances the quality of the final text description by adding a self - attention mechanism on the spatial features. The self - attention mechanism can better represent the semantic information of different regions, thereby generating more natural and fluent text descriptions. Through these improvements, the author's model outperforms the existing state - of - the - art models in the remote - sensing image captioning task and achieves significant improvements on multiple evaluation metrics.