Dual Super-Resolution Learning for Semantic Segmentation

Li Wang,Dong Li,Yousong Zhu,Lu Tian,Yi Shan
DOI: https://doi.org/10.1109/cvpr42600.2020.00383
2020-06-01
Abstract:Current state-of-the-art semantic segmentation methods often apply high-resolution input to attain high performance, which brings large computation budgets and limits their applications on resource-constrained devices. In this paper, we propose a simple and flexible two-stream framework named Dual Super-Resolution Learning (DSRL) to effectively improve the segmentation accuracy without introducing extra computation costs. Specifically, the proposed method consists of three parts: Semantic Segmentation Super-Resolution (SSSR), Single Image Super-Resolution (SISR) and Feature Affinity (FA) module, which can keep high-resolution representations with low-resolution input while simultaneously reducing the model computation complexity. Moreover, it can be easily generalized to other tasks, e.g., human pose estimation. This simple yet effective method leads to strong representations and is evidenced by promising performance on both semantic segmentation and human pose estimation. Specifically, for semantic segmentation on CityScapes, we can achieve ≥ 2% higher mIoU with similar FLOPs, and keep the performance with 70% FLOPs. For human pose estimation, we can gain ≥ 2% mAP with the same FLOPs and maintain mAP with 30% fewer FLOPs. Code and models are available at https://github.com/wanglixilinx/DSRL.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to perform efficient and high - performance semantic segmentation on resource - constrained devices. Specifically, the current state - of - the - art semantic segmentation methods usually require high - resolution inputs to achieve high performance, which brings huge computational costs and limits the application of these methods on resource - constrained devices. The paper proposes a simple and flexible two - stream framework - Dual Super - Resolution Learning (DSRL), aiming to effectively improve the segmentation accuracy without incurring additional computational costs. This framework consists of three parts: Semantic Segmentation Super - Resolution (SSSR), Single Image Super - Resolution (SISR) and Feature Affinity (FA). Through these components, DSRL can maintain high - resolution representations while keeping low - resolution inputs and reduce the computational complexity of the model. In addition, this method can be easily extended to other tasks, such as human pose estimation. The experimental results show that this method performs well in both semantic segmentation and human pose estimation tasks, and can significantly improve performance while reducing the amount of computation. Specifically, on the CityScapes dataset, using DSRL can achieve at least 2% higher mIoU than the baseline method, and can maintain performance with a similar amount of computation; for the human pose estimation task, it can obtain at least a 2% mAP improvement with the same amount of computation and maintain performance with a 30% reduction in the amount of computation.