SCASeg: Strip Cross-Attention for Efficient Semantic Segmentation

Guoan Xu,Jiaming Chen,Wenfeng Huang,Wenjing Jia,Guangwei Gao,Guo-Jun Qi
2024-11-26
Abstract:The Vision Transformer (ViT) has achieved notable success in computer vision, with its variants extensively validated across various downstream tasks, including semantic segmentation. However, designed as general-purpose visual encoders, ViT backbones often overlook the specific needs of task decoders, revealing opportunities to design decoders tailored to efficient semantic segmentation. This paper proposes Strip Cross-Attention (SCASeg), an innovative decoder head explicitly designed for semantic segmentation. Instead of relying on the simple conventional skip connections, we employ lateral connections between the encoder and decoder stages, using encoder features as Queries for the cross-attention modules. Additionally, we introduce a Cross-Layer Block that blends hierarchical feature maps from different encoder and decoder stages to create a unified representation for Keys and Values. To further boost computational efficiency, SCASeg compresses queries and keys into strip-like patterns to optimize memory usage and inference speed over the traditional vanilla cross-attention. Moreover, the Cross-Layer Block incorporates the local perceptual strengths of convolution, enabling SCASeg to capture both global and local context dependencies across multiple layers. This approach facilitates effective feature interaction at different scales, improving the overall performance. Experiments show that the adaptable decoder of SCASeg produces competitive performance across different setups, surpassing leading segmentation architectures on all benchmark datasets, including ADE20K, Cityscapes, COCO-Stuff 164k, and Pascal VOC2012, even under varying computational limitations.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to design a decoder that is both efficient and high - performing in semantic segmentation tasks. Specifically, the paper points out that although Vision Transformer (ViT) has achieved remarkable success in the field of computer vision and its variants have been widely verified in various downstream tasks, when ViT is used as a general - purpose visual encoder, it often ignores the requirements of task - specific decoders. This provides an opportunity to design decoders specifically for efficient semantic segmentation. The paper proposes a novel decoder head named Strip Cross - Attention (SCASeg), aiming to solve the following problems: 1. **Limitations of traditional decoders**: Traditional decoder designs rely on simple skip connections, and these connections may not be sufficient to effectively fuse feature information at different levels. SCASeg enhances feature interaction by introducing lateral connections, using encoder features as Queries in the cross - attention module. 2. **Combination of global and local information**: Although Transformer is excellent at capturing long - distance dependencies, it has deficiencies in local information perception. SCASeg combines global and local information by introducing Cross - Layer Block (CLB), thereby improving the overall performance of the model. 3. **Improvement of computational efficiency**: In order to improve computational efficiency, SCASeg compresses Queries and Keys into strip - like patterns, optimizing memory usage and inference speed. In addition, the Local Perception Module (LPM) in CLB takes advantage of the convolution method to further enhance local perception ability. Through these innovations, SCASeg achieves excellent performance on multiple benchmark datasets (such as ADE20K, Cityscapes, COCO - Stuff 164k and Pascal VOC2012), and can outperform existing leading segmentation architectures even with limited computational resources.