Abstract:Panoptic segmentation is a fundamental task in computer vision and a crucial component for perception in autonomous vehicles. Recent mask-transformer-based methods achieve impressive performance on standard benchmarks but face significant challenges with small objects, crowded scenes and scenes exhibiting a wide range of object scales. We identify several fundamental shortcomings of the current approaches: (i) the query proposal generation process is biased towards larger objects, resulting in missed smaller objects, (ii) initially well-localized queries may drift to other objects, resulting in missed detections, (iii) spatially well-separated instances may be merged into a single mask causing inconsistent and false scene interpretations. To address these issues, we rethink the individual components of the network and its supervision, and propose a novel method for panoptic segmentation PanSR. PanSR effectively mitigates instance merging, enhances small-object detection and increases performance in crowded scenes, delivering a notable +3.4 PQ improvement over state-of-the-art on the challenging LaRS benchmark, while reaching state-of-the-art performance on Cityscapes. The code and models will be publicly available at <a class="link-external link-https" href="https://github.com/lojzezust/PanSR" rel="external noopener nofollow">this https URL</a>.
What problem does this paper attempt to address?
### What problems does this paper attempt to solve?
The paper "PanSR: An Object - Centric Mask Transformer for Panoptic Segmentation" aims to solve some key challenges in the panoptic segmentation task. Specifically, the authors point out several fundamental deficiencies in current methods when dealing with small objects, crowded scenes, and scenes with a wide range of object scales:
1. **Query Proposal Generation Biased Towards Large Objects**:
- Current methods are biased towards larger objects when generating query proposals, causing smaller objects to be ignored.
2. **Query Drift Problem**:
- Queries that are initially well - located may drift to other objects, resulting in missed detections or false detections.
3. **Instance Merging Problem**:
- Instances that are well - separated in space may be merged into a single mask, leading to inconsistent and incorrect scene interpretations.
To solve these problems, the authors propose a new method - PanSR (Object - Centric Mask Transformer for Panoptic Segmentation). PanSR effectively alleviates the instance merging problem, enhances small - object detection, and improves performance in crowded scenes by redesigning various components of the network and their supervision mechanisms.
Specific improvements include:
- **Introducing the Object - Centric Proposal (OCP) Module**: Elevating proposal extraction from the pixel level to the object level to better capture objects of different scales.
- **Introducing a New Proposal - Aware Matching Scheme**: Preventing proposals from being matched with the wrong ground - truth instances, allowing multiple proposals to be matched to a single ground - truth instance, reducing competition among proposals for the same object.
- **Introducing Object - Centric Mask Prediction**: Restricting mask prediction by predicted bounding boxes, avoiding learning global instance - separation features.
- **Introducing Mask - Conditioned Queries**: During the training process, sampling queries from random positions in the object area to improve robustness to noise in the proposal extraction process.
These improvements enable PanSR to achieve a +3.4% PQ improvement over the existing state - of - the - art methods in the LaRS benchmark and reach state - of - the - art performance on the Cityscapes dataset.
### Formula Presentation
To ensure the correctness and readability of the formulas, the formulas involved in the paper are presented in Markdown format as follows:
1. **Query Iterative Update Formula**:
\[
Q^{t + 1}_f, Q^{t + 1}_{box} = D_t(Q^t_f, Q^t_{box}, P_s)
\]
where \( t\in[1..L] \) represents the number of iterations.
2. **Prediction Head Output Formulas**:
\[
M_t = M(P_4, Q^t_f, Q^t_{box})
\]
\[
y_t = C(Q^t_f)
\]
\[
b_t = Q^t_{box}
\]
3. **Content Query Extraction Formula**:
\[
Q^0_f(i)=\sum_j m_i(j)S_{obj}(j)\cdot P_s(j)
\]
4. **Mask Prediction Formula**:
\[
M_t(x)=
\begin{cases}
\sigma(P_4(x)\cdot f(Q_f)) & \text{if } x\in\phi(Q_{box},\epsilon_w,\epsilon_h)\\
0 & \text{otherwise}
\end{cases}
\]
where \(\sigma\) is the sigmoid activation function, \(f\) is a linear projection layer, and \(\phi(·,\epsilon_w,\epsilon_h)\) is a dilation function.
5. **Loss Function Formula**:
\[
L = \sum_s L^s