EfficientFusion: simple and efficient learning with pixel-level fusion for semantic segmentation

Ping Liu,Shuaijie Tian,Yu Gao,Yuting Xie,Shufeng Hao
DOI: https://doi.org/10.1007/s00530-024-01497-4
IF: 3.9
2024-11-28
Multimedia Systems
Abstract:Semantic segmentation is a task that aims to help computers better understand images. The introduction of Vision Transformer has resulted in a shift from traditional CNN architectures to Transformer architectures for many downstream computer vision tasks, especially semantic segmentation tasks. However, the patch-based strategy in Vision Transformer still faces certain limitations. The first limitation is incoherent contextual information caused by the patch-based strategy. The second limitation is redundancy in the number of patches. To address these challenges, we propose a Pixel-Level Fusion block. This block enhances the contextual relationship between patches while merging redundant patches to reduce the overall number of patches with a similarity algorithm. On the COCO-Stuff10k[33] dataset, our method shows significant improvements compared to the state-of-the-art. Specifically, our method achieves a 19.4% increase in mIoU while also providing a 21.6% inference speed improvement on GPU.
computer science, information systems, theory & methods
What problem does this paper attempt to address?