Abstract:Panoptic segmentation is a fundamental task in computer vision and a crucial component for perception in autonomous vehicles. Recent mask-transformer-based methods achieve impressive performance on standard benchmarks but face significant challenges with small objects, crowded scenes and scenes exhibiting a wide range of object scales. We identify several fundamental shortcomings of the current approaches: (i) the query proposal generation process is biased towards larger objects, resulting in missed smaller objects, (ii) initially well-localized queries may drift to other objects, resulting in missed detections, (iii) spatially well-separated instances may be merged into a single mask causing inconsistent and false scene interpretations. To address these issues, we rethink the individual components of the network and its supervision, and propose a novel method for panoptic segmentation PanSR. PanSR effectively mitigates instance merging, enhances small-object detection and increases performance in crowded scenes, delivering a notable +3.4 PQ improvement over state-of-the-art on the challenging LaRS benchmark, while reaching state-of-the-art performance on Cityscapes. The code and models will be publicly available at <a class="link-external link-https" href="https://github.com/lojzezust/PanSR" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? The paper "PanSR: An Object - Centric Mask Transformer for Panoptic Segmentation" aims to solve some key challenges in the panoptic segmentation task. Specifically, the authors point out several fundamental deficiencies in current methods when dealing with small objects, crowded scenes, and scenes with a wide range of object scales: 1. **Query Proposal Generation Biased Towards Large Objects**: - Current methods are biased towards larger objects when generating query proposals, causing smaller objects to be ignored. 2. **Query Drift Problem**: - Queries that are initially well - located may drift to other objects, resulting in missed detections or false detections. 3. **Instance Merging Problem**: - Instances that are well - separated in space may be merged into a single mask, leading to inconsistent and incorrect scene interpretations. To solve these problems, the authors propose a new method - PanSR (Object - Centric Mask Transformer for Panoptic Segmentation). PanSR effectively alleviates the instance merging problem, enhances small - object detection, and improves performance in crowded scenes by redesigning various components of the network and their supervision mechanisms. Specific improvements include: - **Introducing the Object - Centric Proposal (OCP) Module**: Elevating proposal extraction from the pixel level to the object level to better capture objects of different scales. - **Introducing a New Proposal - Aware Matching Scheme**: Preventing proposals from being matched with the wrong ground - truth instances, allowing multiple proposals to be matched to a single ground - truth instance, reducing competition among proposals for the same object. - **Introducing Object - Centric Mask Prediction**: Restricting mask prediction by predicted bounding boxes, avoiding learning global instance - separation features. - **Introducing Mask - Conditioned Queries**: During the training process, sampling queries from random positions in the object area to improve robustness to noise in the proposal extraction process. These improvements enable PanSR to achieve a +3.4% PQ improvement over the existing state - of - the - art methods in the LaRS benchmark and reach state - of - the - art performance on the Cityscapes dataset. ### Formula Presentation To ensure the correctness and readability of the formulas, the formulas involved in the paper are presented in Markdown format as follows: 1. **Query Iterative Update Formula**: \[ Q^{t + 1}_f, Q^{t + 1}_{box} = D_t(Q^t_f, Q^t_{box}, P_s) \] where \( t\in[1..L] \) represents the number of iterations. 2. **Prediction Head Output Formulas**: \[ M_t = M(P_4, Q^t_f, Q^t_{box}) \] \[ y_t = C(Q^t_f) \] \[ b_t = Q^t_{box} \] 3. **Content Query Extraction Formula**: \[ Q^0_f(i)=\sum_j m_i(j)S_{obj}(j)\cdot P_s(j) \] 4. **Mask Prediction Formula**: \[ M_t(x)= \begin{cases} \sigma(P_4(x)\cdot f(Q_f)) & \text{if } x\in\phi(Q_{box},\epsilon_w,\epsilon_h)\\ 0 & \text{otherwise} \end{cases} \] where \(\sigma\) is the sigmoid activation function, \(f\) is a linear projection layer, and \(\phi(·,\epsilon_w,\epsilon_h)\) is a dilation function. 5. **Loss Function Formula**: \[ L = \sum_s L^s

PanSR: An Object-Centric Mask Transformer for Panoptic Segmentation

CenterLPS: Segment Instances by Centers for LiDAR Panoptic Segmentation

Panoptic SegFormer: Delving Deeper into Panoptic Segmentation with Transformers

Panoptic-PartFormer: Learning a Unified Model for Panoptic Part Segmentation

PanopticPartFormer++: A Unified and Decoupled View for Panoptic Part Segmentation

Panoptic-PartFormer++: A Unified and Decoupled View for Panoptic Part Segmentation

PanoOcc: Unified Occupancy Representation for Camera-based 3D Panoptic Segmentation

MC-PanDA: Mask Confidence for Panoptic Domain Adaptation

Position-Guided Point Cloud Panoptic Segmentation Transformer

Panoptic-PolarNet: Proposal-free LiDAR Point Cloud Panoptic Segmentation

MaskConver: Revisiting Pure Convolution Model for Panoptic Segmentation

Point2Mask: Point-supervised Panoptic Segmentation via Optimal Transport

Semantic Refocused Tuning for Open-Vocabulary Panoptic Segmentation

Panoptic SwiftNet: Pyramidal Fusion for Real-Time Panoptic Segmentation

Mask2Anomaly: Mask Transformer for Universal Open-set Segmentation

PanoSSC: Exploring Monocular Panoptic 3D Scene Reconstruction for Autonomous Driving

PANet: LiDAR Panoptic Segmentation with Sparse Instance Proposal and Aggregation

An End-to-End Network for Panoptic Segmentation

Open Panoramic Segmentation

Real-time Panoptic Segmentation with Relationship Between Adjacent Pixels and Boundary Prediction

Panoramic Panoptic Segmentation: Insights Into Surrounding Parsing for Mobile Agents via Unsupervised Contrastive Learning