HAFormer: Unleashing the Power of Hierarchy-Aware Features for Lightweight Semantic Segmentation

Guoan Xu,Wenjing Jia,Tao Wu,Ligeng Chen,Guangwei Gao

DOI: https://doi.org/10.1109/TIP.2024.3425048

2024-07-11

Abstract:Both Convolutional Neural Networks (CNNs) and Transformers have shown great success in semantic segmentation tasks. Efforts have been made to integrate CNNs with Transformer models to capture both local and global context interactions. However, there is still room for enhancement, particularly when considering constraints on computational resources. In this paper, we introduce HAFormer, a model that combines the hierarchical features extraction ability of CNNs with the global dependency modeling capability of Transformers to tackle lightweight semantic segmentation challenges. Specifically, we design a Hierarchy-Aware Pixel-Excitation (HAPE) module for adaptive multi-scale local feature extraction. During the global perception modeling, we devise an Efficient Transformer (ET) module streamlining the quadratic calculations associated with traditional Transformers. Moreover, a correlation-weighted Fusion (cwF) module selectively merges diverse feature representations, significantly enhancing predictive accuracy. HAFormer achieves high performance with minimal computational overhead and compact model size, achieving 74.2% mIoU on Cityscapes and 71.1% mIoU on CamVid test datasets, with frame rates of 105FPS and 118FPS on a single 2080Ti GPU. The source codes are available at <a class="link-external link-https" href="https://github.com/XU-GITHUB-curry/HAFormer" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

### The Problem the Paper Attempts to Solve The paper aims to develop a lightweight semantic segmentation model that combines the advantages of Convolutional Neural Networks (CNN) and Transformers to achieve high-precision semantic segmentation tasks with minimal computational overhead and compact model size. Specifically, the paper proposes the **HAFormer** model, addressing issues in existing methods through the following three main contributions: 1. **Hierarchical Aware Pixel Excitation (HAPE) Module**: Utilizes hierarchical and content-aware attention mechanisms to reduce computational load while extracting deep semantic information at different receptive fields. 2. **Correlation Weighted Fusion (cwF) Module**: Effectively combines local and global contextual features learned by CNNs and Transformers, significantly improving accuracy. 3. **Efficient Transformer Module**: Decomposes Q, K, V matrices to effectively address the quadratic computational complexity problem in traditional Transformer models. Through these innovations, HAFormer achieved 74.2% and 71.1% mIoU on the Cityscapes and CamVid test datasets, respectively, with frame rates of 105FPS and 118FPS on a single 2080Ti GPU. This demonstrates that HAFormer maintains high precision while offering efficient computational performance.

HAFormer: Unleashing the Power of Hierarchy-Aware Features for Lightweight Semantic Segmentation

EHANet: Efficient Hybrid Attention Network Towards Real-time Semantic Segmentation

Real-time Semantic Segmentation with Weighted Factorized-Depthwise Convolution

HD-Former: A hierarchical dependency Transformer for medical image segmentation

SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers

Enhanced Multi-Scale Feature Adaptive Fusion Sparse Convolutional Network for Large-Scale Scenes Semantic Segmentation

H2Former: An Efficient Hierarchical Hybrid Transformer for Medical Image Segmentation

HGFormer: Hierarchical Grouping Transformer for Domain Generalized Semantic Segmentation

UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery

TBFormer: three-branch efficient transformer for semantic segmentation

Enhancing Mask Transformer with Auxiliary Convolution Layers for Semantic Segmentation

LACTNet: A Lightweight Real-Time Semantic Segmentation Network Based on an Aggregated Convolutional Neural Network and Transformer

UNeXt: An Efficient Network for the Semantic Segmentation of High-Resolution Remote Sensing Images

Transformer and CNN Hybrid Deep Neural Network for Semantic Segmentation of Very-High-Resolution Remote Sensing Imagery

TCNet: Multiscale Fusion of Transformer and CNN for Semantic Segmentation of Remote Sensing Images

Head-Free Lightweight Semantic Segmentation with Linear Transformer

Hybrid Dilated Convolution Network Using Attentive Kernels for Real-Time Semantic Segmentation

Lightweight Real-time Semantic Segmentation Network with Efficient Transformer and CNN

MECPformer: Multi-estimations Complementary Patch with CNN-Transformers for Weakly Supervised Semantic Segmentation

MetaSeg: MetaFormer-based Global Contexts-aware Network for Efficient Semantic Segmentation

OneFormer3D: One Transformer for Unified Point Cloud Segmentation