Show Me What and Where has Changed? Question Answering and Grounding for Remote Sensing Change Detection

Ke Li,Fuyu Dong,Di Wang,Shaofeng Li,Quan Wang,Xinbo Gao,Tat-Seng Chua
2024-10-31
Abstract:Remote sensing change detection aims to perceive changes occurring on the Earth's surface from remote sensing data in different periods, and feed these changes back to humans. However, most existing methods only focus on detecting change regions, lacking the ability to interact with users to identify changes that the users expect. In this paper, we introduce a new task named Change Detection Question Answering and Grounding (CDQAG), which extends the traditional change detection task by providing interpretable textual answers and intuitive visual evidence. To this end, we construct the first CDQAG benchmark dataset, termed QAG-360K, comprising over 360K triplets of questions, textual answers, and corresponding high-quality visual masks. It encompasses 10 essential land-cover categories and 8 comprehensive question types, which provides a large-scale and diverse dataset for remote sensing applications. Based on this, we present VisTA, a simple yet effective baseline method that unifies the tasks of question answering and grounding by delivering both visual and textual answers. Our method achieves state-of-the-art results on both the classic CDVQA and the proposed CDQAG datasets. Extensive qualitative and quantitative experimental results provide useful insights for the development of better CDQAG models, and we hope that our work can inspire further research in this important yet underexplored direction. The proposed benchmark dataset and method are available at <a class="link-external link-https" href="https://github.com/like413/VisTA" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### The Problem Addressed by the Paper This paper aims to address a key issue in remote sensing change detection: how to enhance user interaction with remote sensing change detection systems through natural language question answering (QA) and visual grounding. Specifically, existing remote sensing change detection systems mainly focus on full-image analysis, such as binary segmentation and image captioning, lacking flexible interaction with users. This limits the development of user-friendly and efficient intelligent interpretation tasks in remote sensing. To overcome this limitation, the authors introduce a new task—Change Detection Question Answering and Grounding (CDQAG). This task not only generates textual answers but also provides pixel-level visual evidence, allowing users to intuitively verify the answers and increase confidence in the reliability of the results. ### Main Contributions 1. **Proposing the new CDQAG task**: - Unlike traditional Visual Question Answering (VQA), CDQAG not only generates textual answers but also provides pixel-level visual evidence, which is crucial for developing reliable remote sensing change detection systems. 2. **Constructing the first CDQAG benchmark dataset QAG-360K**: - This dataset contains over 360K pairs of questions, answers, and corresponding high-quality visual masks, covering 10 basic land cover categories and 8 comprehensive question types, providing a large-scale and diverse dataset for remote sensing applications. 3. **Proposing a powerful CDQAG baseline method VisTA**: - VisTA achieves state-of-the-art performance on both the classic CDVQA and QAG-360K datasets, demonstrating its strong capability in handling complex questions and cross-modal answer relationships. ### Method Overview - **Dataset Construction**: - The authors collected high-quality remote sensing images from existing binary and semantic change detection datasets, including Hi-UCD, SECOND, and LEVIR-CD, covering different cities and regions with spatial resolutions ranging from 0.1 to 3.0 meters. - Through an automated triplet generation engine, they generated over 6.8K pairs of remote sensing images and 360K pairs of questions, answers, and visual masks. - **Task Definition**: - The input includes a pair of remote sensing images taken at the same location but at different times and a question. The output is a textual answer and the corresponding visual segmentation mask. - **Model Architecture**: - **Text and Image Feature Extraction**: The pre-trained CLIP model is used to extract text features, and two ResNets with shared weights are used to extract multi-scale visual features. - **Multi-stage Semantic Reasoning**: A multi-stage reasoning module is used to achieve fine-grained cross-modal information interaction, generating refined multi-modal features. - **Text-Visual Answer Decoder**: The generated coarse visual mask and question-answer features are used for final prediction, generating precise textual and visual answers. ### Experimental Results - **Performance Evaluation**: - Extensive experiments were conducted on the QAG-360K and classic CDVQA datasets, and VisTA achieved significantly better performance than existing methods on multiple metrics. - The effectiveness of the model was validated through metrics such as Average Accuracy (AA), Overall Accuracy (OA), mean Intersection over Union (mIoU), and overall Intersection over Union (oIoU). ### Conclusion By introducing the CDQAG task and the QAG-360K dataset, the authors provide a new direction for research and applications in the field of remote sensing change detection. The success of the VisTA model demonstrates its potential in handling complex questions and cross-modal answer relationships, laying the foundation for future research.