Abstract:In the era of increasingly advanced Earth Observation (EO) technologies, extracting pertinent information (such as water-bodies) from the Earth's surface has become a crucial task. Deep Learning, especially via pre-trained models, currently offers a highly promising approach for the semantic segmentation of Remote Sensing Imagery (RSI). However, effectively adapting these pre-trained models to RSI tasks remains challenging. Typically, these models undergo fine-tuning for specialized tasks, involving modifications to their parameters or structure of the original architecture, which may impact their inherent generalization capabilities. Furthermore, robust pre-trained models on nature images are not specifically designed for RSI, presenting challenges in their direct application to RSI tasks. To alleviate these problems, our study introduces a light-weight Enhanced Semantic-positional Feature Fusion Network (ESFFNet), leveraging diverse pre-trained image encoders alongside extensive EO data. The proposed method begins by leveraging pre-trained encoders, specifically Vision Transformer (ViT)-based and Convolutional Neural Network (CNN)-based models, to extract deep semantic and precise positional features respectively, without additional training. Following this, we introduce the Enhanced Semantic-positional Feature Fusion Module (ESFFM). This module adeptly merges semantic features derived from the ViT-based encoder with spatial features extracted from the CNN-based encoder. Such integration is realized via multi-scale feature fusion, local and long-distance feature integration, and dense connectivity strategies, leading to a robust feature representation. Finally, the Primary Segmentation-guided Fine Extraction Module (PSFEM) further bolsters the precision of remote sensing image segmentation. Collectively, these two modules constitute our light-weight decoder, with a parameter size of less than 4 M. Our approach is evaluated on two distinct water-body datasets, indicating superiority over other leading segmentation techniques. In addition, our method also demonstrates exemplary efficacy in diverse remote sensing segmentation tasks, such as building extraction and land cover classification. The source codes will be available at https://github.com/zhilyzhang/ESFFNet.

DESformer: A Dual-branch Encoding Strategy for Semantic Segmentation of Very-High-Resolution Remote Sensing Images Based on Feature Interaction and Multi-Scale Context Fusion

Deep Dual-Stream Network with Scale Context Selection Attention Module for Semantic Segmentation

Dual-Path Feature Aware Network for Remote Sensing Image Semantic Segmentation

Transformer and CNN Hybrid Deep Neural Network for Semantic Segmentation of Very-High-Resolution Remote Sensing Imagery

Dual attention deep fusion semantic segmentation networks of large-scale satellite remote-sensing images

Dual-Path Feature Fusion Network for Semantic Segmentation of Remote Sensing Images

Cascaded CNN and global–local attention transformer network-based semantic segmentation for high-resolution remote sensing image

A Multi-Step Fusion Network for Semantic Segmentation of High-Resolution Aerial Images

A Transformer-based Multi-Modal Fusion Network for Semantic Segmentation of High-Resolution Remote Sensing Imagery

Hybrid Attention Fusion Embedded in Transformer for Remote Sensing Image Semantic Segmentation

Dual-Domain Fusion Network Based on Wavelet Frequency Decomposition and Fuzzy Spatial Constraint for Remote Sensing Image Segmentation

Global and edge enhanced transformer for semantic segmentation of remote sensing

A ViT-Based Multiscale Feature Fusion Approach for Remote Sensing Image Segmentation

Densely Based Multi-Scale and Multi-Modal Fully Convolutional Networks for High-Resolution Remote-Sensing Image Semantic Segmentation

An Attention-Fused Network for Semantic Segmentation of Very-High-Resolution Remote Sensing Imagery

Remote Sensing Image Semantic Segmentation Method Based on a Deep Convolutional Neural Network and Multiscale Feature Fusion

A semantic segmentation method for remote sensing images based on multiple contextual feature extraction

Enhanced semantic-positional feature fusion network via diverse pre-trained encoders for remote sensing image water-body segmentation

Advancing high-resolution remote sensing: a compact and powerful approach to semantic segmentation

Multi-View Feature Fusion and Rich Information Refinement Network for Semantic Segmentation of Remote Sensing Images

Semantic Segmentation of Very-High-Resolution Remote Sensing Images via Deep Multi-Feature Learning