Abstract:Existing semantic segmentation works have been mainly focused on designing effective decoders; however, the computational load introduced by the overall structure has long been ignored, which hinders their applications on resource-constrained hardwares. In this paper, we propose a head-free lightweight architecture specifically for semantic segmentation, named Adaptive Frequency Transformer (AFFormer). AFFormer adopts a parallel architecture to leverage prototype representations as specific learnable local descriptions which replaces the decoder and preserves the rich image semantics on high-resolution features. Although removing the decoder compresses most of the computation, the accuracy of the parallel structure is still limited by low computational resources. Therefore, we employ heterogeneous operators (CNN and vision Transformer) for pixel embedding and prototype representations to further save computational costs. Moreover, it is very difficult to linearize the complexity of the vision Transformer from the perspective of spatial domain. Due to the fact that semantic segmentation is very sensitive to frequency information, we construct a lightweight prototype learning block with adaptive frequency filter of complexity O(n) to replace standard self attention with O(n^2). Extensive experiments on widely adopted datasets demonstrate that AFFormer achieves superior accuracy while retaining only 3M parameters. On the ADE20K dataset, AFFormer achieves 41.8 mIoU and 4.6 GFLOPs, which is 4.4 mIoU higher than Segformer, with 45% less GFLOPs. On the Cityscapes dataset, AFFormer achieves 78.7 mIoU and 34.4 GFLOPs, which is 2.5 mIoU higher than Segformer with 72.5% less GFLOPs. Code is available at https://github.com/dongbo811/AFFormer.

EfficientFusion: simple and efficient learning with pixel-level fusion for semantic segmentation

Research of improving semantic image segmentation based on a feature fusion model

Fast Semantic Segmentation of Ultra-High-Resolution Remote Sensing Images via Score Map and Fast Transformer-Based Fusion

ExFuse: Enhancing Feature Fusion for Semantic Segmentation

Enhancing Feature Fusion with Spatial Aggregation and Channel Fusion for Semantic Segmentation

Segformer++: Efficient Token-Merging Strategies for High-Resolution Semantic Segmentation

Hybrid Attention Fusion Embedded in Transformer for Remote Sensing Image Semantic Segmentation

Semantic Image Segmentation with Improved Position Attention and Feature Fusion

Pyramid Fusion Transformer for Semantic Segmentation

Semantic Segmentation via Highly Fused Convolutional Network with Multiple Soft Cost Functions

STransFuse: Fusing Swin Transformer and Convolutional Neural Network for Remote Sensing Image Semantic Segmentation

SFPFusion: An Improved Vision Transformer Combining Super Feature Attention and Wavelet-Guided Pooling for Infrared and Visible Images Fusion

Head-Free Lightweight Semantic Segmentation with Linear Transformer

Efficient Depth Fusion Transformer for Aerial Image Semantic Segmentation

Transformer-Based Cross-Modal Information Fusion Network for Semantic Segmentation

LACTNet: A Lightweight Real-Time Semantic Segmentation Network Based on an Aggregated Convolutional Neural Network and Transformer

CNN and Transformer Fusion for Remote Sensing Image Semantic Segmentation

FusFormer: global and detail feature fusion transformer for semantic segmentation of small objects

Feature Reuse and Fusion for Real-time Semantic segmentation

A Unified Efficient Pyramid Transformer for Semantic Segmentation

Real-Time Semantic Segmentation via Multiply Spatial Fusion Network