Abstract:As remote sensing imaging technology continues to advance and evolve, processing high-resolution and diversified satellite imagery to improve segmentation accuracy and enhance interpretation efficiency emerg as a pivotal area of investigation within the realm of remote sensing. Although segmentation algorithms based on CNNs and Transformers achieve significant progress in performance, balancing segmentation accuracy and computational complexity remains challenging, limiting their wide application in practical tasks. To address this, this paper introduces state space model (SSM) and proposes a novel hybrid semantic segmentation network based on vision Mamba (CVMH-UNet). This method designs a cross-scanning visual state space block (CVSSBlock) that uses cross 2D scanning (CS2D) to fully capture global information from multiple directions, while by incorporating convolutional neural network branches to overcome the constraints of Vision Mamba (VMamba) in acquiring local information, this approach facilitates a comprehensive analysis of both global and local features. Furthermore, to address the issue of limited discriminative power and the difficulty in achieving detailed fusion with direct skip connections, a multi-frequency multi-scale feature fusion block (MFMSBlock) is designed. This module introduces multi-frequency information through 2D discrete cosine transform (2D DCT) to enhance information utilization and provides additional scale local detail information through point-wise convolution branches. Finally, it aggregates multi-scale information along the channel dimension, achieving refined feature fusion. Findings from experiments conducted on renowned datasets of remote sensing imagery demonstrate that proposed CVMH-UNet achieves superior segmentation performance while maintaining low computational complexity, outperforming surpassing current leading-edge segmentation algorithms.

MSEB: Plug and Play Multi-Scale Image Embedding Block for Vision Backbone

Multi-Scale VMamba: Hierarchy in Hierarchy Visual State Space Model

MFE‐MVSNet: Multi‐scale feature enhancement multi‐view stereo with bi‐directional connections

MSPE: Multi-Scale Patch Embedding Prompts Vision Transformers to Any Resolution

Res2Net: A New Multi-Scale Backbone Architecture

MHS-VM: Multi-Head Scanning in Parallel Subspaces for Vision Mamba

MSVM-UNet: Multi-Scale Vision Mamba UNet for Medical Image Segmentation

VMamba: Visual State Space Model

Remote Sensing Image Segmentation Using Vision Mamba and Multi-Scale Multi-Frequency Feature Fusion

Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

BSI-MVS: multi-view stereo network with bidirectional semantic information

Memory-Based Neighbourhood Embedding for Visual Recognition

Bacterial supersystem for alginate import/metabolism and its environmental and bioenergy applications

SatSwinMAE: Efficient Autoencoding for Multiscale Time-series Satellite Imagery

NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

Multi-scale Unified Network for Image Classification

VM-UNET-V2 Rethinking Vision Mamba UNet for Medical Image Segmentation

A Multi-Scale Feature Extraction Network for Marine Underwater Image Enhancement

Multi-Hot Compact Network Embedding

ME-FCN: A Multi-Scale Feature-Enhanced Fully Convolutional Network for Building Footprint Extraction