Abstract:High-quality semantic segmentation relies on three key capabilities: global context modeling, local detail encoding, and multi-scale feature extraction. However, recent methods struggle to possess all these capabilities simultaneously. Hence, we aim to empower segmentation networks to simultaneously carry out efficient global context modeling, high-quality local detail encoding, and rich multi-scale feature representation for varying input resolutions. In this paper, we introduce SegMAN, a novel linear-time model comprising a hybrid feature encoder dubbed SegMAN Encoder, and a decoder based on state space models. Specifically, the SegMAN Encoder synergistically integrates sliding local attention with dynamic state space models, enabling highly efficient global context modeling while preserving fine-grained local details. Meanwhile, the MMSCopE module in our decoder enhances multi-scale context feature extraction and adaptively scales with the input resolution. We comprehensively evaluate SegMAN on three challenging datasets: ADE20K, Cityscapes, and COCO-Stuff. For instance, SegMAN-B achieves 52.6% mIoU on ADE20K, outperforming SegNeXt-L by 1.6% mIoU while reducing computational complexity by over 15% GFLOPs. On Cityscapes, SegMAN-B attains 83.8% mIoU, surpassing SegFormer-B3 by 2.1% mIoU with approximately half the GFLOPs. Similarly, SegMAN-B improves upon VWFormer-B3 by 1.6% mIoU with lower GFLOPs on the COCO-Stuff dataset. Our code is available at <a class="link-external link-https" href="https://github.com/yunxiangfu2001/SegMAN" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that the existing semantic segmentation methods cannot simultaneously possess the capabilities of global context modeling, high - quality local detail encoding, and multi - scale feature extraction. Specifically, when dealing with high - resolution inputs, the existing methods often struggle to maintain the ability of global context modeling and face challenges in terms of computational complexity. In addition, some methods perform poorly in capturing fine - grained local details. To address these issues, the authors propose SegMAN (Omni - scale Context Modeling with State Space Models and Local Attention for Semantic Segmentation), which is a novel linear - time model that combines the sliding local attention mechanism and the state - space model to achieve efficient global context modeling, high - quality local detail encoding, and rich multi - scale feature representation. Specifically, the main contributions of SegMAN include: 1. **Introducing a new encoder architecture**: This architecture combines local attention and the state - space model (LASS) in the token mixer, which can efficiently perform global context modeling and local detail encoding while maintaining linear - time complexity. 2. **Proposing a new decoder module MMSCopE**: This module can adaptively extract context information on feature maps of different scales, surpassing the performance of previous methods in terms of fine - grained detail preservation and full - scale context learning. 3. **Verified through extensive experiments**: SegMAN has achieved state - of - the - art performance on multiple challenging semantic segmentation benchmark datasets while maintaining competitive computational efficiency. ### Key technical points in the paper - **LASS (Local Attention and State Space) module**: It combines local attention (Natten) and the two - dimensional selective scanning block (SS2D), and can efficiently capture global context and local details while maintaining linear - time complexity. - **MMSCopE (Mamba - based Multi - Scale Context Extraction) module**: It scans multi - scale feature maps through Mamba's dynamic state - space model to extract rich multi - scale context information. ### Experimental results SegMAN performs excellently on the three datasets of ADE20K, Cityscapes, and COCO - Stuff. In particular, it can still achieve a relatively high mIoU (mean Intersection over Union) metric even with a lower computational complexity. For example, on the ADE20K dataset, SegMAN - B reaches 52.6% mIoU, which is 1.6% higher than SegNeXt - L, while reducing the GFLOPs calculation amount by more than 15%. In conclusion, through innovative encoder and decoder designs, SegMAN has successfully addressed the shortcomings of existing semantic segmentation methods in global context modeling, local detail encoding, and multi - scale feature extraction, providing a more efficient and accurate solution for the semantic segmentation task.

SegMAN: Omni-scale Context Modeling with State Space Models and Local Attention for Semantic Segmentation

Deep Dual-Stream Network with Scale Context Selection Attention Module for Semantic Segmentation

Enhanced Multi-Scale Feature Adaptive Fusion Sparse Convolutional Network for Large-Scale Scenes Semantic Segmentation

MetaSeg: MetaFormer-based Global Contexts-aware Network for Efficient Semantic Segmentation

MANet: Fine-Tuning Segment Anything Model for Multimodal Remote Sensing Semantic Segmentation

LMANet: A Lightweight Asymmetric Semantic Segmentation Network Based on Multi-Scale Feature Extraction

SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers

Attention Guided Global Enhancement and Local Refinement Network for Semantic Segmentation

Multiattention Network for Semantic Segmentation of Fine-Resolution Remote Sensing Images

A Deep Semantic Segmentation Network with Semantic and Contextual Refinements

Multi-Attention-Network for Semantic Segmentation of Fine Resolution Remote Sensing Images

CMANet: Cross-Modality Attention Network for Indoor-Scene Semantic Segmentation

SAN: Side Adapter Network for Open-vocabulary Semantic Segmentation

CDMANet: central difference mutual attention network for RGB-D semantic segmentation

Sigma: Siamese Mamba Network for Multi-Modal Semantic Segmentation

Lightweight semantic segmentation network with configurable context and small object attention

MEDANet: More Efficient Dual Attention Network for Scene Segmentation

Simple Scalable Multimodal Semantic Segmentation Model

Compensating for Local Ambiguity With Encoder-Decoder in Urban Scene Segmentation

Semantic Segmentation Via Structured Patch Prediction, Context Crf And Guidance Crf

MultiSenseSeg: A Cost-Effective Unified Multimodal Semantic Segmentation Model for Remote Sensing