Abstract:Estimating 6D poses and reconstructing 3D shapes of objects in open-world scenes from RGB-depth image pairs is challenging. Many existing methods rely on learning geometric features that correspond to specific templates while disregarding shape variations and pose differences among objects in the same category. As a result, these methods underperform when handling unseen object instances in complex environments. In contrast, other approaches aim to achieve category-level estimation and reconstruction by leveraging normalized geometric structure priors, but the static prior-based reconstruction struggles with substantial intra-class variations. To solve these problems, we propose the DTF-Net, a novel framework for pose estimation and shape reconstruction based on implicit neural fields of object categories. In DTF-Net, we design a deformable template field to represent the general category-wise shape latent features and intra-category geometric deformation features. The field establishes continuous shape correspondences, deforming the category template into arbitrary observed instances to accomplish shape reconstruction. We introduce a pose regression module that shares the deformation features and template codes from the fields to estimate the accurate 6D pose of each object in the scene. We integrate a multi-modal representation extraction module to extract object features and semantic masks, enabling end-to-end inference. Moreover, during training, we implement a shape-invariant training strategy and a viewpoint sampling method to further enhance the model's capability to extract object pose features. Extensive experiments on the REAL275 and CAMERA25 datasets demonstrate the superiority of DTF-Net in both synthetic and real scenes. Furthermore, we show that DTF-Net effectively supports grasping tasks with a real robot arm.

Learning shared template representation with augmented feature for multi-object pose estimation

Supplementary Material: Quasi-Dense Similarity Learning for Multiple Object Tracking

Unseen Object Pose Estimation via Registration

Realtime and Robust Object Matching with a Large Number of Templates

Zero-Shot 3d Pose Estimation of Unseen Object by Two-Step Rgb-D Fusion

Robust Object Recognition Via Weakly Supervised Metric and Template Learning

Realtime object matching with robust dominant orientation templates

Learning Mixed Templates for Object Recognition

GS-Pose: Category-Level Object Pose Estimation via Geometric and Semantic Correspondence

Learning accurate template matching with differentiable coarse-to-fine correspondence refinement

PoseMatcher: One-shot 6D Object Pose Estimation by Deep Feature Matching

Template NeRF: Towards Modeling Dense Shape Correspondences from Category-Specific Object Images

Deep Template Matching for Pedestrian Attribute Recognition with the Auxiliary Supervision of Attribute-wise Keypoints

AtptTrack: Asymmetric Transformer Tracker With Prior Templates

Learning a Proposal Classifier for Multiple Object Tracking

Multi-Person Articulated Tracking With Spatial and Temporal Embeddings

DTF-Net: Category-Level Pose Estimation and Shape Reconstruction via Deformable Template Field

Cross-Viewpoint Template Matching Based on Heterogeneous Feature Alignment and Pixel-Wise Consensus for Air- and Space-Based Platforms

Single-Shot and Multi-Shot Feature Learning for Multi-Object Tracking

Representation Alignment Contrastive Regularization for Multi-Object Tracking

Shape-Former: Bridging CNN and Transformer via ShapeConv for multimodal image matching