Revisiting Multi-Scale Feature Fusion for Semantic Segmentation

Tianjian Meng,Golnaz Ghiasi,Reza Mahjourian,Quoc V. Le,Mingxing Tan
DOI: https://doi.org/10.48550/arXiv.2203.12683
2022-06-15
Abstract:It is commonly believed that high internal resolution combined with expensive operations (e.g. atrous convolutions) are necessary for accurate semantic segmentation, resulting in slow speed and large memory usage. In this paper, we question this belief and demonstrate that neither high internal resolution nor atrous convolutions are necessary. Our intuition is that although segmentation is a dense per-pixel prediction task, the semantics of each pixel often depend on both nearby neighbors and far-away context; therefore, a more powerful multi-scale feature fusion network plays a critical role. Following this intuition, we revisit the conventional multi-scale feature space (typically capped at P5) and extend it to a much richer space, up to P9, where the smallest features are only 1/512 of the input size and thus have very large receptive fields. To process such a rich feature space, we leverage the recent BiFPN to fuse the multi-scale features. Based on these insights, we develop a simplified segmentation model, named ESeg, which has neither high internal resolution nor expensive atrous convolutions. Perhaps surprisingly, our simple method can achieve better accuracy with faster speed than prior art across multiple datasets. In real-time settings, ESeg-Lite-S achieves 76.0% mIoU on CityScapes [12] at 189 FPS, outperforming FasterSeg [9] (73.1% mIoU at 170 FPS). Our ESeg-Lite-L runs at 79 FPS and achieves 80.1% mIoU, largely closing the gap between real-time and high-performance segmentation models.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the dependence on high internal resolution and expensive operations (such as dilated convolution) in semantic segmentation tasks, which leads to slow model running speed and large memory usage. Specifically, the paper challenges the currently widely - held view that high internal resolution is necessary for accurate semantic segmentation, and proposes a new method. That is, by expanding the multi - scale feature space and using a powerful multi - scale feature fusion network, efficient and accurate semantic segmentation can be achieved without relying on high internal resolution or expensive dilated convolution. ### Main contributions of the paper: 1. **Questioning existing assumptions**: The paper questions the necessity of high internal resolution and dilated convolution in semantic segmentation and proposes a new idea. 2. **Expanding the multi - scale feature space**: The paper expands the traditional multi - scale feature space from the usual P5 to P9, making the smallest feature map only 1/512 of the input image size, thus having a very large receptive field. 3. **Simplifying the model design**: Based on the above - expanded multi - scale feature space, the paper proposes a simplified semantic segmentation model ESeg, which has neither high internal resolution nor expensive dilated convolution. 4. **Performance improvement**: Experimental results show that ESeg achieves higher accuracy and faster inference speed on multiple datasets, especially performing well in real - time scenarios. ### Key technical points: - **Multi - scale feature space**: Expanded to P9, adding low - resolution feature maps, thereby expanding the receptive field. - **Bidirectional Feature Pyramid Network (BiFPN)**: Used for effectively fusing multi - scale features. Compared with the traditional top - down Feature Pyramid Network (FPN), BiFPN allows top - down and bottom - up feature fusion. - **Simple and efficient model structure**: Using a standard encoder - decoder structure, the encoder adopts EfficientNet, the decoder adopts BiFPN, and finally generates pixel - level predictions through weighted summation. ### Experimental results: - **CityScapes dataset**: ESeg - Lite - S achieves 76.0% mIoU on the CityScapes validation set at a speed of 189 FPS, outperforming the previous real - time model FasterSeg (73.1% mIoU, 170 FPS). - **ADE20K dataset**: ESeg - L also performs excellently on the ADE20K dataset, reaching 48.2% mIoU in single - scale evaluation, outperforming most existing CNN models. ### Conclusion: By expanding the multi - scale feature space and using a powerful feature fusion network, the paper successfully achieves efficient and accurate semantic segmentation without relying on high internal resolution and expensive operations. This method not only outperforms existing models in performance but also performs well in real - time applications, providing a new direction for future semantic segmentation research.