Abstract:Semantic segmentation empowers numerous real-world applications, such as autonomous driving and augmented/mixed reality. These applications often operate on high-resolution images (e.g., 8 megapixels) to capture the fine details. However, this comes at the cost of considerable computational complexity, hindering the deployment in latency-sensitive scenarios. In this paper, we introduce SparseRefine, a novel approach that enhances dense low-resolution predictions with sparse high-resolution refinements. Based on coarse low-resolution outputs, SparseRefine first uses an entropy selector to identify a sparse set of pixels with high entropy. It then employs a sparse feature extractor to efficiently generate the refinements for those pixels of interest. Finally, it leverages a gated ensembler to apply these sparse refinements to the initial coarse predictions. SparseRefine can be seamlessly integrated into any existing semantic segmentation model, regardless of CNN- or ViT-based. SparseRefine achieves significant speedup: 1.5 to 3.7 times when applied to HRNet-W48, SegFormer-B5, Mask2Former-T/L and SegNeXt-L on Cityscapes, with negligible to no loss of accuracy. Our "dense+sparse" paradigm paves the way for efficient high-resolution visual computing.

What problem does this paper attempt to address?

The paper aims to address the issue of high computational complexity in high-resolution semantic segmentation tasks, particularly in real-time applications such as autonomous driving and augmented reality/mixed reality. Specifically, to capture details, these applications often need to process high-resolution images (e.g., 8 million pixels), which brings significant computational costs and limits deployment in latency-sensitive scenarios. To tackle this challenge, the paper proposes the SparseRefine method, a novel approach that improves efficiency by combining low-resolution dense prediction with high-resolution sparse refinement. The working principle of SparseRefine is as follows: 1. **Low-Resolution Dense Prediction**: First, the input image is downsampled, and an existing semantic segmentation model is used to perform dense prediction on it, resulting in an initial prediction. 2. **Sparse Pixel Selection**: Next, an entropy selector is used to identify a sparse set of pixels with high uncertainty in the initial prediction, which are typically the locations of prediction errors. 3. **Sparse Feature Extraction**: Then, high-resolution feature extraction is performed only on these selected high-entropy pixels to efficiently generate refinement information. 4. **Fusion of Refinement Information**: Finally, a gated fuser is used to apply these sparse refinement information to the initial low-resolution prediction to obtain the final prediction result. This method can be seamlessly integrated into any existing semantic segmentation model, whether based on Convolutional Neural Networks (CNN) or Vision Transformers (ViT). Experiments show that SparseRefine significantly reduces computational load and inference time while maintaining or slightly improving accuracy, with speedup factors ranging from 1.5 to 3.7 times. Additionally, the method demonstrates general effectiveness across various datasets, including Pascal VOC, BDD100K, Deepglobe, and ISIC. In summary, SparseRefine provides an effective "dense + sparse" paradigm that can significantly reduce the cost of high-resolution visual computation while maintaining or slightly improving accuracy.

Sparse Refinement for Efficient High-Resolution Semantic Segmentation

A Scalable Real-time Semantic Segmentation Network for Autonomous Driving

RefineNet: Multi-Path Refinement Networks for High-Resolution Semantic Segmentation

A Deep Semantic Segmentation Network with Semantic and Contextual Refinements

STRNet: Semantic Segmentation Based on Small Target Refinement at Large Scale

Light-Weight RefineNet for Real-Time Semantic Segmentation

Enhanced Multi-Scale Feature Adaptive Fusion Sparse Convolutional Network for Large-Scale Scenes Semantic Segmentation

RFNet: A Refinement Network for Semantic Segmentation

A feature refinement module for light-weight semantic segmentation network

Real-time Semantic Image Segmentation Via Spatial Sparsity.

Advancing high-resolution remote sensing: a compact and powerful approach to semantic segmentation

Refinement Co‐supervision Network for Real‐time Semantic Segmentation

Object-Enhanced Semantic Segmentation Model for High-Resolution Remote Sensing Images

Exploring Fine-Grained Sparsity in Convolutional Neural Networks for Efficient Inference

HoloSeg: an Efficient Holographic Segmentation Network for Real-time Scene Parsing

RefineNet: Multi-Path Refinement Networks for Dense Prediction

DWRSeg: Rethinking Efficient Acquisition of Multi-scale Contextual Information for Real-time Semantic Segmentation

A Lightweight Network for Fast Semantic Segmentation.

SegRefiner: Towards Model-Agnostic Segmentation Refinement with Discrete Diffusion Process

RefineMask: Towards High-Quality Instance Segmentation with Fine-Grained Features

EFRNet: A Lightweight Network with Efficient Feature Fusion and Refinement for Real-Time Semantic Segmentation