Abstract:In the era of increasingly advanced Earth Observation (EO) technologies, extracting pertinent information (such as water-bodies) from the Earth's surface has become a crucial task. Deep Learning, especially via pre-trained models, currently offers a highly promising approach for the semantic segmentation of Remote Sensing Imagery (RSI). However, effectively adapting these pre-trained models to RSI tasks remains challenging. Typically, these models undergo fine-tuning for specialized tasks, involving modifications to their parameters or structure of the original architecture, which may impact their inherent generalization capabilities. Furthermore, robust pre-trained models on nature images are not specifically designed for RSI, presenting challenges in their direct application to RSI tasks. To alleviate these problems, our study introduces a light-weight Enhanced Semantic-positional Feature Fusion Network (ESFFNet), leveraging diverse pre-trained image encoders alongside extensive EO data. The proposed method begins by leveraging pre-trained encoders, specifically Vision Transformer (ViT)-based and Convolutional Neural Network (CNN)-based models, to extract deep semantic and precise positional features respectively, without additional training. Following this, we introduce the Enhanced Semantic-positional Feature Fusion Module (ESFFM). This module adeptly merges semantic features derived from the ViT-based encoder with spatial features extracted from the CNN-based encoder. Such integration is realized via multi-scale feature fusion, local and long-distance feature integration, and dense connectivity strategies, leading to a robust feature representation. Finally, the Primary Segmentation-guided Fine Extraction Module (PSFEM) further bolsters the precision of remote sensing image segmentation. Collectively, these two modules constitute our light-weight decoder, with a parameter size of less than 4 M. Our approach is evaluated on two distinct water-body datasets, indicating superiority over other leading segmentation techniques. In addition, our method also demonstrates exemplary efficacy in diverse remote sensing segmentation tasks, such as building extraction and land cover classification. The source codes will be available at https://github.com/zhilyzhang/ESFFNet.

MFVNet: a deep adaptive fusion network with multiple field-of-views for remote sensing image semantic segmentation

Enhanced Multi-Scale Feature Adaptive Fusion Sparse Convolutional Network for Large-Scale Scenes Semantic Segmentation

MFCA-Net: a deep learning method for semantic segmentation of remote sensing images

Encoder- and Decoder-Based Networks Using Multiscale Feature Fusion and Nonlocal Block for Remote Sensing Image Semantic Segmentation

MFAFNet: A Lightweight and Efficient Network with Multi-Level Feature Adaptive Fusion for Real-Time Semantic Segmentation

Deep Multimodal Fusion Network for Semantic Segmentation Using Remote Sensing Image and LiDAR Data

LMFNet: An Efficient Multimodal Fusion Approach for Semantic Segmentation in High-Resolution Remote Sensing

Multi-View Feature Fusion and Rich Information Refinement Network for Semantic Segmentation of Remote Sensing Images

Semantic Segmentation of Very-High-Resolution Remote Sensing Images via Deep Multi-Feature Learning

Multi-Field Context Fusion Network for Semantic Segmentation of High-Spatial-Resolution Remote Sensing Images

A Crossmodal Multiscale Fusion Network for Semantic Segmentation of Remote Sensing Data

A Transformer-based Multi-Modal Fusion Network for Semantic Segmentation of High-Resolution Remote Sensing Imagery

Multi-scale attention fusion network for semantic segmentation of remote sensing images

Enhanced semantic-positional feature fusion network via diverse pre-trained encoders for remote sensing image water-body segmentation

MCAFNet: A Multiscale Channel Attention Fusion Network for Semantic Segmentation of Remote Sensing Images

MFEAFN: Multi-scale feature enhanced adaptive fusion network for image semantic segmentation

GFFNet: Global Feature Fusion Network for Semantic Segmentation of Large-Scale Remote Sensing Images

MAFNet: dual-branch fusion network with multiscale atrous pyramid pooling aggregate contextual features for real-time semantic segmentation

Real-Time Semantic Segmentation via Multiply Spatial Fusion Network

An Attention-Fused Network for Semantic Segmentation of Very-High-Resolution Remote Sensing Imagery

Residual Spatial Fusion Network for RGB-Thermal Semantic Segmentation