Abstract:Open-vocabulary panoptic reconstruction offers comprehensive scene understanding, enabling advances in embodied robotics and photorealistic simulation. In this paper, we propose PanopticRecon++, an end-to-end method that formulates panoptic reconstruction through a novel cross-attention perspective. This perspective models the relationship between 3D instances (as queries) and the scene's 3D embedding field (as keys) through their attention map. Unlike existing methods that separate the optimization of queries and keys or overlook spatial proximity, PanopticRecon++ introduces learnable 3D Gaussians as instance queries. This formulation injects 3D spatial priors to preserve proximity while maintaining end-to-end optimizability. Moreover, this query formulation facilitates the alignment of 2D open-vocabulary instance IDs across frames by leveraging optimal linear assignment with instance masks rendered from the queries. Additionally, we ensure semantic-instance segmentation consistency by fusing query-based instance segmentation probabilities with semantic probabilities in a novel panoptic head supervised by a panoptic loss. During training, the number of instance query tokens dynamically adapts to match the number of objects. PanopticRecon++ shows competitive performance in terms of 3D and 2D segmentation and reconstruction performance on both simulation and real-world datasets, and demonstrates a user case as a robot simulator. Our project website is at: <a class="link-external link-https" href="https://yuxuan1206.github.io/panopticrecon_pp/" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve several key problems in open - vocabulary panoptic reconstruction, specifically including: 1. **Misalignment**: - In the entire data sequence, the 2D instance IDs between different frames are not aligned. This will lead to difficulties in maintaining consistent instance identification in multi - frame scenes. 2. **Ambiguity**: - Due to the limited field of view (FoV), it is impossible to determine whether objects that have never co - appeared in the same image belong to the same instance. This uncertainty makes cross - frame instance matching difficult. 3. **Inconsistency**: - Existing methods usually use two independent branches to model semantic and instance labels, resulting in inconsistent semantic segmentation and instance segmentation results. In addition, these methods lack unified panoptic supervision, which affects the accuracy of segmentation. To solve these problems, the paper proposes PanopticRecon++, an end - to - end open - vocabulary panoptic reconstruction method based on the cross - attention mechanism. By introducing learnable 3D Gaussian distributions as instance queries and combining spatial prior information, PanopticRecon++ can achieve end - to - end optimization while maintaining neighbor relationships. In addition, this method also aligns 2D instance IDs by linearly assigning instance masks and ensures the consistency of semantic and instance segmentation through parameter - free generalization heads. ### Formula representation 1. **Instance class feature aggregation formula**: \[ F_i=\sum_j A_{ij} V_j=\sum_j\frac{\exp(Q_i^T K_j)}{\sum_j\exp(Q_i^T K_j)} V_j \] where \( A_{ij} \) represents the similarity between query \( Q_i \) and key \( K_j \), and \( V_j \) represents the semantic feature of point \( j \). 2. **Attention map combining feature similarity and spatial prior**: \[ A_{ij}=\frac{\exp(S(f_q, f_k) G(p_q, p_k))}{\sum_j\exp(S(f_q, f_k) G(p_q, p_k))} \] where the feature similarity \( S(f_q, f_k)=\sigma(f_q^T f_k) \) and the spatial prior \( G(p_q, p_k)=\frac{P(p_k|p_q,\Sigma)}{P(p_q|p_q,\Sigma)} \). 3. **3D Intersection over Min (IoM)**: \[ \text{IoM}_{ij}=\frac{M_i\cap M_j}{M_i} \] which is used to identify duplicate instance labels. Through these improvements, PanopticRecon++ provides a more robust, consistent, and efficient open - vocabulary panoptic reconstruction method and performs well on both simulated and real - world datasets.

Leverage Cross-Attention for End-to-End Open-Vocabulary Panoptic Reconstruction

PanopticRecon: Leverage Open-vocabulary Instance Segmentation for Zero-shot Panoptic Reconstruction

PanoRecon: Real-Time Panoptic 3D Reconstruction from Monocular Video

Can We PASS Beyond the Field of View? Panoramic Annular Semantic Segmentation for Real-World Surrounding Perception

EPRecon: An Efficient Framework for Real-Time Panoptic 3D Reconstruction from Monocular Video

PanoOcc: Unified Occupancy Representation for Camera-based 3D Panoptic Segmentation

PanoSSC: Exploring Monocular Panoptic 3D Scene Reconstruction for Autonomous Driving

PVO: Panoptic Visual Odometry.

DeepPanoContext: Panoramic 3D Scene Understanding with Holistic Scene Context Graph and Relation-based Optimization

Panoptic-PartFormer: Learning a Unified Model for Panoptic Part Segmentation

Panoptic Lifting for 3D Scene Understanding with Neural Fields

Towards Panoptic 3D Parsing for Single Image in the Wild

BUOL: A Bottom-Up Framework with Occupancy-aware Lifting for Panoptic 3D Scene Reconstruction From A Single Image

PanopticNeRF-360: Panoramic 3D-to-2D Label Transfer in Urban Scenes

Panoptic-FlashOcc: An Efficient Baseline to Marry Semantic Occupancy with Panoptic via Instance Center

3D Open-Vocabulary Panoptic Segmentation with 2D-3D Vision-Language Distillation

Open Panoramic Segmentation

Panoptic 3D Scene Reconstruction From a Single RGB Image

OpenOcc: Open Vocabulary 3D Scene Reconstruction via Occupancy Representation

Panoptic-PolarNet: Proposal-free LiDAR Point Cloud Panoptic Segmentation