Leverage Cross-Attention for End-to-End Open-Vocabulary Panoptic Reconstruction

Xuan Yu,Yuxuan Xie,Yili Liu,Haojian Lu,Rong Xiong,Yiyi Liao,Yue Wang
2025-01-02
Abstract:Open-vocabulary panoptic reconstruction offers comprehensive scene understanding, enabling advances in embodied robotics and photorealistic simulation. In this paper, we propose PanopticRecon++, an end-to-end method that formulates panoptic reconstruction through a novel cross-attention perspective. This perspective models the relationship between 3D instances (as queries) and the scene's 3D embedding field (as keys) through their attention map. Unlike existing methods that separate the optimization of queries and keys or overlook spatial proximity, PanopticRecon++ introduces learnable 3D Gaussians as instance queries. This formulation injects 3D spatial priors to preserve proximity while maintaining end-to-end optimizability. Moreover, this query formulation facilitates the alignment of 2D open-vocabulary instance IDs across frames by leveraging optimal linear assignment with instance masks rendered from the queries. Additionally, we ensure semantic-instance segmentation consistency by fusing query-based instance segmentation probabilities with semantic probabilities in a novel panoptic head supervised by a panoptic loss. During training, the number of instance query tokens dynamically adapts to match the number of objects. PanopticRecon++ shows competitive performance in terms of 3D and 2D segmentation and reconstruction performance on both simulation and real-world datasets, and demonstrates a user case as a robot simulator. Our project website is at: <a class="link-external link-https" href="https://yuxuan1206.github.io/panopticrecon_pp/" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition,Robotics
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to solve several key problems in open - vocabulary panoptic reconstruction, specifically including: 1. **Misalignment**: - In the entire data sequence, the 2D instance IDs between different frames are not aligned. This will lead to difficulties in maintaining consistent instance identification in multi - frame scenes. 2. **Ambiguity**: - Due to the limited field of view (FoV), it is impossible to determine whether objects that have never co - appeared in the same image belong to the same instance. This uncertainty makes cross - frame instance matching difficult. 3. **Inconsistency**: - Existing methods usually use two independent branches to model semantic and instance labels, resulting in inconsistent semantic segmentation and instance segmentation results. In addition, these methods lack unified panoptic supervision, which affects the accuracy of segmentation. To solve these problems, the paper proposes PanopticRecon++, an end - to - end open - vocabulary panoptic reconstruction method based on the cross - attention mechanism. By introducing learnable 3D Gaussian distributions as instance queries and combining spatial prior information, PanopticRecon++ can achieve end - to - end optimization while maintaining neighbor relationships. In addition, this method also aligns 2D instance IDs by linearly assigning instance masks and ensures the consistency of semantic and instance segmentation through parameter - free generalization heads. ### Formula representation 1. **Instance class feature aggregation formula**: \[ F_i=\sum_j A_{ij} V_j=\sum_j\frac{\exp(Q_i^T K_j)}{\sum_j\exp(Q_i^T K_j)} V_j \] where \( A_{ij} \) represents the similarity between query \( Q_i \) and key \( K_j \), and \( V_j \) represents the semantic feature of point \( j \). 2. **Attention map combining feature similarity and spatial prior**: \[ A_{ij}=\frac{\exp(S(f_q, f_k) G(p_q, p_k))}{\sum_j\exp(S(f_q, f_k) G(p_q, p_k))} \] where the feature similarity \( S(f_q, f_k)=\sigma(f_q^T f_k) \) and the spatial prior \( G(p_q, p_k)=\frac{P(p_k|p_q,\Sigma)}{P(p_q|p_q,\Sigma)} \). 3. **3D Intersection over Min (IoM)**: \[ \text{IoM}_{ij}=\frac{M_i\cap M_j}{M_i} \] which is used to identify duplicate instance labels. Through these improvements, PanopticRecon++ provides a more robust, consistent, and efficient open - vocabulary panoptic reconstruction method and performs well on both simulated and real - world datasets.