STSNet: A Cross-Spatial Resolution Multi-Modal Remote Sensing Deep Fusion Network for High Resolution Land-Cover Segmentation

Beibei Yu,Jiayi Li,Xin Huang
DOI: https://doi.org/10.1016/j.inffus.2024.102689
2025-01-01
Abstract:Recently, deep learning models have found extensive application in high-resolution land-cover segmentation research. However, the most current research still suffers from issues such as insufficient utilization of multi-modal information, which limits further improvement in high-resolution land-cover segmentation accuracy. Moreover, differences in the size and spatial resolution of multi-modal datasets collectively pose challenges to multi-modal land-cover segmentation. Therefore, we propose a high-resolution land-cover segmentation network (STSNet) with cross-spatial resolution spatio-temporal-spectral deep fusion. This network effectively utilizes spatio-temporal-spectral features to achieve information complementary among multi-modal data. Specifically, STSNet consists of four components: (1) A high resolution and multi-scale spatial-spectral encoder to jointly extract subtle spatial-spectral features in hyperspectral and high spatial resolution images. (2) A long-term spatio-temporal encoder formulated by spectral convolution and spatio-temporal transformer block to simultaneously delineates the spatial, temporal and spectral information in dense time series Sentinel-2 imagery. (3) A cross-resolution fusion module to alleviate the spatial resolution differences between multi-modal data and effectively leverages complementary spatio-temporal-spectral information. (4) A multi-scale decoder integrates multi-scale information from multi-modal data. We utilized airborne hyperspectral remote sensing imagery from the Shenyang region of China in 2020, with a spatial resolution of 1authors declare that they have no known competm, a spectral number of 249, and a spectral resolution <= 5 nm, and its Sentinel dense time-series images acquired in the same period with a spatial resolution of 10 m, a spectral number of 10, and a time-series number of 31. These datasets were combined to generate a multi-modal dataset called WHU-(HSR)-S-2-MT, which is the first open accessed large-scale high spatio-temporal-spectral satellite remote sensing dataset (i.e., with >2500 image pairs sized 300 m x 300 m for each). Additionally, we employed two open-source datasets to validate the effectiveness of the proposed modules. Extensive experiments show that our multi-scale spatial-spectral encoder, spatio-temporal encoder, and cross-resolution fusion module outperform existing state-of-the-art (SOTA) algorithms in terms of overall performance on high-resolution land-cover segmentation. The new multi-modal dataset will be made available at http://irsip.whu.edu.cn/resources/resources_en_v2.php, along with the corresponding code for accessing and utilizing the dataset at https://github.com/RS-Mage/STSNet.
What problem does this paper attempt to address?