Abstract:The Vision Transformer (ViT) has achieved notable success in computer vision, with its variants extensively validated across various downstream tasks, including semantic segmentation. However, designed as general-purpose visual encoders, ViT backbones often overlook the specific needs of task decoders, revealing opportunities to design decoders tailored to efficient semantic segmentation. This paper proposes Strip Cross-Attention (SCASeg), an innovative decoder head explicitly designed for semantic segmentation. Instead of relying on the simple conventional skip connections, we employ lateral connections between the encoder and decoder stages, using encoder features as Queries for the cross-attention modules. Additionally, we introduce a Cross-Layer Block that blends hierarchical feature maps from different encoder and decoder stages to create a unified representation for Keys and Values. To further boost computational efficiency, SCASeg compresses queries and keys into strip-like patterns to optimize memory usage and inference speed over the traditional vanilla cross-attention. Moreover, the Cross-Layer Block incorporates the local perceptual strengths of convolution, enabling SCASeg to capture both global and local context dependencies across multiple layers. This approach facilitates effective feature interaction at different scales, improving the overall performance. Experiments show that the adaptable decoder of SCASeg produces competitive performance across different setups, surpassing leading segmentation architectures on all benchmark datasets, including ADE20K, Cityscapes, COCO-Stuff 164k, and Pascal VOC2012, even under varying computational limitations.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to design a decoder that is both efficient and high - performing in semantic segmentation tasks. Specifically, the paper points out that although Vision Transformer (ViT) has achieved remarkable success in the field of computer vision and its variants have been widely verified in various downstream tasks, when ViT is used as a general - purpose visual encoder, it often ignores the requirements of task - specific decoders. This provides an opportunity to design decoders specifically for efficient semantic segmentation. The paper proposes a novel decoder head named Strip Cross - Attention (SCASeg), aiming to solve the following problems: 1. **Limitations of traditional decoders**: Traditional decoder designs rely on simple skip connections, and these connections may not be sufficient to effectively fuse feature information at different levels. SCASeg enhances feature interaction by introducing lateral connections, using encoder features as Queries in the cross - attention module. 2. **Combination of global and local information**: Although Transformer is excellent at capturing long - distance dependencies, it has deficiencies in local information perception. SCASeg combines global and local information by introducing Cross - Layer Block (CLB), thereby improving the overall performance of the model. 3. **Improvement of computational efficiency**: In order to improve computational efficiency, SCASeg compresses Queries and Keys into strip - like patterns, optimizing memory usage and inference speed. In addition, the Local Perception Module (LPM) in CLB takes advantage of the convolution method to further enhance local perception ability. Through these innovations, SCASeg achieves excellent performance on multiple benchmark datasets (such as ADE20K, Cityscapes, COCO - Stuff 164k and Pascal VOC2012), and can outperform existing leading segmentation architectures even with limited computational resources.

SCASeg: Strip Cross-Attention for Efficient Semantic Segmentation

SegViTv2: Exploring Efficient and Continual Semantic Segmentation with Plain Vision Transformers.

A Lightweight Network with Attention Decoder for Real-Time Semantic Segmentation

SegViT: Semantic Segmentation with Plain Vision Transformers

SCFI-ESeg: Enhancing Semantic Segmentation with Spatial and Content Feature Integration

SegNeXt: Rethinking Convolutional Attention Design for Semantic Segmentation

Semantic segmentation using cross-stage feature reweighting and efficient self-attention

Semantic Segmentation Based on Vision Transformer Via Interactive Attention

A Transformer-Based Decoder for Semantic Segmentation with Multi-level Context Mining

Sed: Searching Enhanced Decoder with Switchable Skip Connection for Semantic Segmentation

Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

BAGNet: Branch Attention Guided Decoder for Semantic Segmentation

Effective SAM Combination for Open-Vocabulary Semantic Segmentation

Bilateral attention decoder: A lightweight decoder for real-time semantic segmentation

SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers

MSGFormer: A DeepLabv3+ Like Semantically Masked and Pixel Contrast Transformer for MouseHole Segmentation

Learning Cross-Channel Representations for Semantic Segmentation

Rethinking Decoders for Transformer-based Semantic Segmentation: Compression is All You Need

FeedFormer: Revisiting Transformer Decoder for Efficient Semantic Segmentation

A Fast Attention-Guided Hierarchical Decoding Network for Real-Time Semantic Segmentation

LKASeg:Remote-Sensing Image Semantic Segmentation with Large Kernel Attention and Full-Scale Skip Connections