Sparse Refinement for Efficient High-Resolution Semantic Segmentation

Zhijian Liu,Zhuoyang Zhang,Samir Khaki,Shang Yang,Haotian Tang,Chenfeng Xu,Kurt Keutzer,Song Han
2024-07-27
Abstract:Semantic segmentation empowers numerous real-world applications, such as autonomous driving and augmented/mixed reality. These applications often operate on high-resolution images (e.g., 8 megapixels) to capture the fine details. However, this comes at the cost of considerable computational complexity, hindering the deployment in latency-sensitive scenarios. In this paper, we introduce SparseRefine, a novel approach that enhances dense low-resolution predictions with sparse high-resolution refinements. Based on coarse low-resolution outputs, SparseRefine first uses an entropy selector to identify a sparse set of pixels with high entropy. It then employs a sparse feature extractor to efficiently generate the refinements for those pixels of interest. Finally, it leverages a gated ensembler to apply these sparse refinements to the initial coarse predictions. SparseRefine can be seamlessly integrated into any existing semantic segmentation model, regardless of CNN- or ViT-based. SparseRefine achieves significant speedup: 1.5 to 3.7 times when applied to HRNet-W48, SegFormer-B5, Mask2Former-T/L and SegNeXt-L on Cityscapes, with negligible to no loss of accuracy. Our "dense+sparse" paradigm paves the way for efficient high-resolution visual computing.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address the issue of high computational complexity in high-resolution semantic segmentation tasks, particularly in real-time applications such as autonomous driving and augmented reality/mixed reality. Specifically, to capture details, these applications often need to process high-resolution images (e.g., 8 million pixels), which brings significant computational costs and limits deployment in latency-sensitive scenarios. To tackle this challenge, the paper proposes the SparseRefine method, a novel approach that improves efficiency by combining low-resolution dense prediction with high-resolution sparse refinement. The working principle of SparseRefine is as follows: 1. **Low-Resolution Dense Prediction**: First, the input image is downsampled, and an existing semantic segmentation model is used to perform dense prediction on it, resulting in an initial prediction. 2. **Sparse Pixel Selection**: Next, an entropy selector is used to identify a sparse set of pixels with high uncertainty in the initial prediction, which are typically the locations of prediction errors. 3. **Sparse Feature Extraction**: Then, high-resolution feature extraction is performed only on these selected high-entropy pixels to efficiently generate refinement information. 4. **Fusion of Refinement Information**: Finally, a gated fuser is used to apply these sparse refinement information to the initial low-resolution prediction to obtain the final prediction result. This method can be seamlessly integrated into any existing semantic segmentation model, whether based on Convolutional Neural Networks (CNN) or Vision Transformers (ViT). Experiments show that SparseRefine significantly reduces computational load and inference time while maintaining or slightly improving accuracy, with speedup factors ranging from 1.5 to 3.7 times. Additionally, the method demonstrates general effectiveness across various datasets, including Pascal VOC, BDD100K, Deepglobe, and ISIC. In summary, SparseRefine provides an effective "dense + sparse" paradigm that can significantly reduce the cost of high-resolution visual computation while maintaining or slightly improving accuracy.