Abstract:Category-level pose estimation is a challenging task with many potential applications in computer vision and robotics. Recently, deep-learning-based approaches have made great progress, but are typically hindered by the need for large datasets of either pose-labelled real images or carefully tuned photorealistic simulators. This can be avoided by using only geometry inputs such as depth images to reduce the domain-gap but these approaches suffer from a lack of semantic information, which can be vital in the pose estimation problem. To resolve this conflict, we propose to utilize both geometric and semantic features obtained from a pre-trained foundation model.Our approach projects 2D features from this foundation model into 3D for a single object model per category, and then performs matching against this for new single view observations of unseen object instances with a trained matching network. This requires significantly less data to train than prior methods since the semantic features are robust to object texture and appearance. We demonstrate this with a rich evaluation, showing improved performance over prior methods with a fraction of the data required.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is category - level object pose estimation in the fields of computer vision and robotics. Specifically, the author points out that although current deep - learning - based methods have made significant progress in instance - level object pose estimation, they still face challenges in category - level object pose estimation. The main problems include: 1. **High data requirements**: Most methods require a large number of real images with pose annotations or data generated by carefully calibrated realistic simulators, which are often difficult to obtain in practical applications. 2. **Trade - off between geometric and semantic information**: Using only geometric information (such as depth maps) can reduce the domain gap, but lacks semantic information, resulting in ambiguity in pose estimation problems; while using only RGB images will introduce a large number of texture and appearance changes, increasing the complexity of training data. To overcome these challenges, the author proposes a new method that combines geometric and semantic features, utilizes 2D semantic features extracted by pre - trained base models (such as DINOv2) and projects them into 3D space for category - level object pose estimation. In this way, this method can improve the performance and robustness of pose estimation with less training data. ### Main contributions: 1. **Novel geometric and semantic representations**: Combining geometric and semantic features greatly improves the performance of category - level object pose estimation. 2. **Robust Transformer matching network**: A Transformer matching network is proposed for dense correspondence matching between partial and complete information in 3D space. 3. **Rich experimental evaluations**: Through comparative experiments with multiple methods, the advantages of this method in data efficiency and performance are demonstrated. ### Method overview: 1. **Feature embedding**: Utilize pre - trained base models (such as DINOv2) to extract 2D semantic features and project them into 3D space to generate 3D semantic features for each category. 2. **Semantic feature wrapping**: Generate 3D semantic features through multi - view rendering and feature fusion. 3. **Transformer matching network**: Use the Transformer structure to fuse partial 3D semantic features and complete 3D semantic features for accurate 3D matching. 4. **Symmetry disambiguation**: Solve the pose estimation problem of symmetric objects by constraining the xz plane of the object to intersect with the origin of the camera coordinate system. ### Experimental results: - **NOCS REAL275 dataset**: On multiple metrics such as 3D IoU, rotation and translation accuracy, this method performs comparably to methods trained with real data when trained only with synthetic data, and even exceeds the latter on some metrics. - **Wild6D dataset**: On the Wild6D dataset, which contains more categories and instances, this method shows good generalization ability, especially when dealing with challenging categories (such as cups and laptops). - **SUN RGB - D dataset**: In indoor scenes, this method performs excellently in the zero - shot object pose estimation task, significantly outperforming the baseline methods. In general, this paper proposes a new method that combines geometric and semantic features, effectively solves the data requirements and ambiguity problems in category - level object pose estimation, and demonstrates superior performance on multiple datasets.

GS-Pose: Category-Level Object Pose Estimation via Geometric and Semantic Correspondence

Learning Stereopsis from Geometric Synthesis for 6D Object Pose Estimation

HS-Pose: Hybrid Scope Feature Extraction for Category-level Object Pose Estimation

GPV-Pose: Category-level Object Pose Estimation Via Geometry-guided Point-wise Voting

Zero-Shot 3d Pose Estimation of Unseen Object by Two-Step Rgb-D Fusion

Synthetic Depth Image-based Category-Level Object Pose Estimation with Effective Pose Decoupling and Shape Optimization

Instance-Adaptive and Geometric-Aware Keypoint Learning for Category-Level 6D Object Pose Estimation

Self-Supervised Geometric Correspondence for Category-Level 6D Object Pose Estimation in the Wild

TG-Pose: Delving Into Topology and Geometry for Category-Level Object Pose Estimation

Learning Geometric Consistency and Discrepancy for Category-Level 6D Object Pose Estimation from Point Clouds

Generative Category-Level Shape and Pose Estimation with Semantic Primitives

SecondPose: SE(3)-Consistent Dual-Stream Feature Fusion for Category-Level Pose Estimation

DONet: Learning Category-Level 6D Object Pose and Size Estimation from Depth Observation

Geometric Pose Affordance: 3D Human Pose with Scene Constraints

Unsupervised Learning of Category-Level 3D Pose from Object-Centric Videos

Leveraging SE(3) Equivariance for Self-Supervised Category-Level Object Pose Estimation

You Only Look at One: Category-Level Object Representations for Pose Estimation From a Single Example

GS-Pose: Generalizable Segmentation-based 6D Object Pose Estimation with 3D Gaussian Splatting

SD-Pose: Semantic Decomposition for Cross-Domain 6D Object Pose Estimation

Learning Symmetry-Aware Geometry Correspondences for 6D Object Pose Estimation

Nothing But Geometric Constraints: A Model-Free Method for Articulated Object Pose Estimation