Abstract:It is commonly believed that high internal resolution combined with expensive operations (e.g. atrous convolutions) are necessary for accurate semantic segmentation, resulting in slow speed and large memory usage. In this paper, we question this belief and demonstrate that neither high internal resolution nor atrous convolutions are necessary. Our intuition is that although segmentation is a dense per-pixel prediction task, the semantics of each pixel often depend on both nearby neighbors and far-away context; therefore, a more powerful multi-scale feature fusion network plays a critical role. Following this intuition, we revisit the conventional multi-scale feature space (typically capped at P5) and extend it to a much richer space, up to P9, where the smallest features are only 1/512 of the input size and thus have very large receptive fields. To process such a rich feature space, we leverage the recent BiFPN to fuse the multi-scale features. Based on these insights, we develop a simplified segmentation model, named ESeg, which has neither high internal resolution nor expensive atrous convolutions. Perhaps surprisingly, our simple method can achieve better accuracy with faster speed than prior art across multiple datasets. In real-time settings, ESeg-Lite-S achieves 76.0% mIoU on CityScapes [12] at 189 FPS, outperforming FasterSeg [9] (73.1% mIoU at 170 FPS). Our ESeg-Lite-L runs at 79 FPS and achieves 80.1% mIoU, largely closing the gap between real-time and high-performance segmentation models.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the dependence on high internal resolution and expensive operations (such as dilated convolution) in semantic segmentation tasks, which leads to slow model running speed and large memory usage. Specifically, the paper challenges the currently widely - held view that high internal resolution is necessary for accurate semantic segmentation, and proposes a new method. That is, by expanding the multi - scale feature space and using a powerful multi - scale feature fusion network, efficient and accurate semantic segmentation can be achieved without relying on high internal resolution or expensive dilated convolution. ### Main contributions of the paper: 1. **Questioning existing assumptions**: The paper questions the necessity of high internal resolution and dilated convolution in semantic segmentation and proposes a new idea. 2. **Expanding the multi - scale feature space**: The paper expands the traditional multi - scale feature space from the usual P5 to P9, making the smallest feature map only 1/512 of the input image size, thus having a very large receptive field. 3. **Simplifying the model design**: Based on the above - expanded multi - scale feature space, the paper proposes a simplified semantic segmentation model ESeg, which has neither high internal resolution nor expensive dilated convolution. 4. **Performance improvement**: Experimental results show that ESeg achieves higher accuracy and faster inference speed on multiple datasets, especially performing well in real - time scenarios. ### Key technical points: - **Multi - scale feature space**: Expanded to P9, adding low - resolution feature maps, thereby expanding the receptive field. - **Bidirectional Feature Pyramid Network (BiFPN)**: Used for effectively fusing multi - scale features. Compared with the traditional top - down Feature Pyramid Network (FPN), BiFPN allows top - down and bottom - up feature fusion. - **Simple and efficient model structure**: Using a standard encoder - decoder structure, the encoder adopts EfficientNet, the decoder adopts BiFPN, and finally generates pixel - level predictions through weighted summation. ### Experimental results: - **CityScapes dataset**: ESeg - Lite - S achieves 76.0% mIoU on the CityScapes validation set at a speed of 189 FPS, outperforming the previous real - time model FasterSeg (73.1% mIoU, 170 FPS). - **ADE20K dataset**: ESeg - L also performs excellently on the ADE20K dataset, reaching 48.2% mIoU in single - scale evaluation, outperforming most existing CNN models. ### Conclusion: By expanding the multi - scale feature space and using a powerful feature fusion network, the paper successfully achieves efficient and accurate semantic segmentation without relying on high internal resolution and expensive operations. This method not only outperforms existing models in performance but also performs well in real - time applications, providing a new direction for future semantic segmentation research.

Revisiting Multi-Scale Feature Fusion for Semantic Segmentation

Deep Dual-Stream Network with Scale Context Selection Attention Module for Semantic Segmentation

Enhanced Multi-Scale Feature Adaptive Fusion Sparse Convolutional Network for Large-Scale Scenes Semantic Segmentation

ASFNet: Adaptive Multiscale Segmentation Fusion Network for Real‐time Semantic Segmentation

S$^2$-FPN: Scale-ware Strip Attention Guided Feature Pyramid Network for Real-time Semantic Segmentation

BFMNet: Bilateral Feature Fusion Network with Multi-Scale Context Aggregation for Real-Time Semantic Segmentation.

MFEAFN: Multi-scale feature enhanced adaptive fusion network for image semantic segmentation

Semantic Segmentation Based on Spatial Pyramid Pooling and Multilayer Feature Fusion

Real-Time Semantic Segmentation via Multiply Spatial Fusion Network

A Multi-level Feature Fusion Network for Real-time Semantic Segmentation

Based on cross-scale fusion attention mechanism network for semantic segmentation for street scenes

CFFNet: Cross-scale Feature Fusion Network for Real-Time Semantic Segmentation

Enhanced Feature Pyramid Network for Semantic Segmentation.

Context and Boundary Guided Multi-Scale Feature Fusion Network for Semantic Segmentation

SCFI-ESeg: Enhancing Semantic Segmentation with Spatial and Content Feature Integration

MSCFNet: A Lightweight Network with Multi-Scale Context Fusion for Real-Time Semantic Segmentation

Multiscale Fusion Convolutional Network in Real-time Semantic Segmentation

RELAXNet: Residual Efficient Learning and Attention Expected Fusion Network for Real-Time Semantic Segmentation

MFAFNet: A Lightweight and Efficient Network with Multi-Level Feature Adaptive Fusion for Real-Time Semantic Segmentation

Dsmrseg: Dual-Stage Feature Pyramid And Multi-Range Context Aggregation For Real-Time Semantic Segmentation

Multi-Scale Spatial Location Preference For Semantic Segmentation