A TextGCN-Based Decoding Approach for Improving Remote Sensing Image Captioning

Swadhin Das,Raksha Sharma
2024-10-12
Abstract:Remote sensing images are highly valued for their ability to address complex real-world issues such as risk management, security, and meteorology. However, manually captioning these images is challenging and requires specialized knowledge across various domains. This letter presents an approach for automatically describing (captioning) remote sensing images. We propose a novel encoder-decoder setup that deploys a Text Graph Convolutional Network (TextGCN) and multi-layer LSTMs. The embeddings generated by TextGCN enhance the decoder's understanding by capturing the semantic relationships among words at both the sentence and corpus levels. Furthermore, we advance our approach with a comparison-based beam search method to ensure fairness in the search strategy for generating the final caption. We present an extensive evaluation of our approach against various other state-of-the-art encoder-decoder frameworks. We evaluated our method across three datasets using seven metrics: BLEU-1 to BLEU-4, METEOR, ROUGE-L, and CIDEr. The results demonstrate that our approach significantly outperforms other state-of-the-art encoder-decoder methods.
Machine Learning
What problem does this paper attempt to address?
The paper attempts to address the problem of automatically describing (generating captions for) remote sensing images. Specifically, manually generating descriptions for these complex remote sensing images is very difficult and requires expertise. Therefore, this paper proposes an encoder-decoder framework based on Text Graph Convolutional Network (TextGCN) and multi-layer LSTM to improve the automatic description capability of remote sensing images. ### Main Issues 1. **Challenges of Manually Annotating Remote Sensing Images**: - Remote sensing images are very complex and involve expertise from multiple fields. - Manual annotation is time-consuming and prone to errors. 2. **Limitations of Existing Methods**: - Existing automatic annotation methods tend to overfit when dealing with complex images. - The search mechanism is imperfect, resulting in low-quality descriptions. ### Solution 1. **Encoder-Decoder Framework**: - Use ResNet to extract image features. - Use TextGCN to generate word embeddings, enhancing the decoder's understanding of semantic relationships. - Employ multi-layer LSTM for decoding, improving the model's comprehensive understanding of image and text features. 2. **Improved Search Strategy**: - Introduce a comparison-based beam search method, combining BLEU-2, METEOR, and ROUGE-L scores to balance precision, recall, and the longest common subsequence. - Include sentences generated by greedy search to diversify the search strategy. ### Experimental Results - The effectiveness of the model was validated through various evaluation metrics (BLEU-1 to BLEU-4, METEOR, ROUGE-L, and CIDEr). - Results show that the proposed model significantly outperforms other existing methods. ### Summary This paper proposes an encoder-decoder framework combining TextGCN and multi-layer LSTM, along with an improved search strategy, effectively addressing the challenges of automatic annotation of remote sensing images and improving the quality and accuracy of generated descriptions.