ELiFormer: A Hierarchical Transformer Based Model with Efficient Encoder and Lightweight Decoder for Semantic Segmentation.

Zixuan Wu,Yue Zhou
DOI: https://doi.org/10.1145/3663976.3663985
2024-01-01
Abstract:Fully-convolutional networks have limited global information modeling, while Transformers can be computationally demanding. This paper introduces ELiFormer, a lightweight model combining Transformer and CNN components for semantic segmentation tasks. We start by using a lightweight inverted residual module to extract initial features, aiming to reduce computational complexity and improve model efficiency. Within encoder module, we adopt depth-wise separable convolution for flexible position embedding, notably improving segmentation accuracy for small objects. We also employs two self-attention mechanisms: local window multi-head self-attention and global window multi-head self-attention, which generate multi-scale feature maps, enhancing the encoder’s ability to capture complex relationships. Dropout layers are introduced to reduce computational demand during training. To address semantic disparity between CNN and visual Transformer, we introduce fusion block. Finally, we utilize a lightweight MLP decoder to streamline the model and ensure robust segmentation performance. Our extensive ablation study shows that ELiFormer attains significant results for semantic segmentation, which is competitive on Cityscapes datasets.
What problem does this paper attempt to address?