SegMAN: Omni-scale Context Modeling with State Space Models and Local Attention for Semantic Segmentation

Yunxiang Fu,Meng Lou,Yizhou Yu
2024-12-16
Abstract:High-quality semantic segmentation relies on three key capabilities: global context modeling, local detail encoding, and multi-scale feature extraction. However, recent methods struggle to possess all these capabilities simultaneously. Hence, we aim to empower segmentation networks to simultaneously carry out efficient global context modeling, high-quality local detail encoding, and rich multi-scale feature representation for varying input resolutions. In this paper, we introduce SegMAN, a novel linear-time model comprising a hybrid feature encoder dubbed SegMAN Encoder, and a decoder based on state space models. Specifically, the SegMAN Encoder synergistically integrates sliding local attention with dynamic state space models, enabling highly efficient global context modeling while preserving fine-grained local details. Meanwhile, the MMSCopE module in our decoder enhances multi-scale context feature extraction and adaptively scales with the input resolution. We comprehensively evaluate SegMAN on three challenging datasets: ADE20K, Cityscapes, and COCO-Stuff. For instance, SegMAN-B achieves 52.6% mIoU on ADE20K, outperforming SegNeXt-L by 1.6% mIoU while reducing computational complexity by over 15% GFLOPs. On Cityscapes, SegMAN-B attains 83.8% mIoU, surpassing SegFormer-B3 by 2.1% mIoU with approximately half the GFLOPs. Similarly, SegMAN-B improves upon VWFormer-B3 by 1.6% mIoU with lower GFLOPs on the COCO-Stuff dataset. Our code is available at <a class="link-external link-https" href="https://github.com/yunxiangfu2001/SegMAN" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that the existing semantic segmentation methods cannot simultaneously possess the capabilities of global context modeling, high - quality local detail encoding, and multi - scale feature extraction. Specifically, when dealing with high - resolution inputs, the existing methods often struggle to maintain the ability of global context modeling and face challenges in terms of computational complexity. In addition, some methods perform poorly in capturing fine - grained local details. To address these issues, the authors propose SegMAN (Omni - scale Context Modeling with State Space Models and Local Attention for Semantic Segmentation), which is a novel linear - time model that combines the sliding local attention mechanism and the state - space model to achieve efficient global context modeling, high - quality local detail encoding, and rich multi - scale feature representation. Specifically, the main contributions of SegMAN include: 1. **Introducing a new encoder architecture**: This architecture combines local attention and the state - space model (LASS) in the token mixer, which can efficiently perform global context modeling and local detail encoding while maintaining linear - time complexity. 2. **Proposing a new decoder module MMSCopE**: This module can adaptively extract context information on feature maps of different scales, surpassing the performance of previous methods in terms of fine - grained detail preservation and full - scale context learning. 3. **Verified through extensive experiments**: SegMAN has achieved state - of - the - art performance on multiple challenging semantic segmentation benchmark datasets while maintaining competitive computational efficiency. ### Key technical points in the paper - **LASS (Local Attention and State Space) module**: It combines local attention (Natten) and the two - dimensional selective scanning block (SS2D), and can efficiently capture global context and local details while maintaining linear - time complexity. - **MMSCopE (Mamba - based Multi - Scale Context Extraction) module**: It scans multi - scale feature maps through Mamba's dynamic state - space model to extract rich multi - scale context information. ### Experimental results SegMAN performs excellently on the three datasets of ADE20K, Cityscapes, and COCO - Stuff. In particular, it can still achieve a relatively high mIoU (mean Intersection over Union) metric even with a lower computational complexity. For example, on the ADE20K dataset, SegMAN - B reaches 52.6% mIoU, which is 1.6% higher than SegNeXt - L, while reducing the GFLOPs calculation amount by more than 15%. In conclusion, through innovative encoder and decoder designs, SegMAN has successfully addressed the shortcomings of existing semantic segmentation methods in global context modeling, local detail encoding, and multi - scale feature extraction, providing a more efficient and accurate solution for the semantic segmentation task.