Abstract:6DoF pose estimation has received much attention in recent years. A key challenge is the difficulty of estimating object pose when the target texture is weak. In this work, we present the cross-modal Transformer (CMT-6D), a Transformer-based network suitable for highly accurate workpiece-level object 6D pose estimation from a single RGBD image. Our main insight is to make the surface texture information of RGB images with the geometric feature information of point clouds complement each other through a cross-modal Transformer, enabling accurate estimation of the pose of weakly textured targets. Specifically, the whole framework consists of two parallel Transformer branches, named Point Transformer and Image Transformer. Both parallel transformer networks use a pyramid structured encoder and a multi-layer perceptron structured decoder to extract geometric features of point clouds and texture features of RGB images, respectively. Then, a cross-modal key query strategy is proposed for information exchange between parallel channels. In addition, at the output representation stage, we design a simple and effective 3D keypoint selection algorithm to solve the problem that keypoints are likely to appear in the non-significant region. Finally, to improve the accuracy of attitude estimation and meet real-time requirements, a lightweight pose iterative network based on target feature regression is proposed to correct the initial attitude estimation error. Extensive experiments demonstrate the effectiveness and superiority of our method on LineMOD, Occlusion LineMOD, T-Less, and YCB-Video datasets. We demonstrate that our method can improve the 6D pose estimation performance by comparing with the state-of-the-art. Ablation research and visualization validate the design of CMT-6D.

CMA: Cross-modal Attention for 6D Object Pose Estimation

MoreFusion: Multi-object Reasoning for 6D Pose Estimation from Volumetric Fusion

A modal fusion network with dual attention mechanism for 6D pose estimation

Recurrent Volume-based 3D Feature Fusion for Real-time Multi-view Object Pose Estimation

Recurrent Volume-Based 3-D Feature Fusion for Real-Time Multiview Object Pose Estimation.

A Transformer-based multi-modal fusion network for 6D pose estimation

FEIF: Feature Excitation and Interactive Fusion for 6D Object Pose Estimation.

MixedFusion: 6D Object Pose Estimation from Decoupled RGB-Depth Features.

Zero-Shot 3d Pose Estimation of Unseen Object by Two-Step Rgb-D Fusion

PA-Pose: Partial Point Cloud Fusion Based on Reliable Alignment for 6D Pose Tracking

Cross-Modal Attentional Context Learning for RGB-D Object Detection

Efficient Bi-manipulation using RGBD Multi-model Fusion based on Attention Mechanism

CMT-6D: a lightweight iterative 6DoF pose estimation network based on cross-modal Transformer

B2C-AFM: Bi-Directional Co-Temporal and Cross-Spatial Attention Fusion Model for Human Action Recognition.

PAM:Point-wise Attention Module for 6D Object Pose Estimation

6-DoF grasp estimation method that fuses RGB-D data based on external attention

Attention Guided 6D Object Pose Estimation with Multi-constraints Voting Network

Towards Two-view 6D Object Pose Estimation: A Comparative Study on Fusion Strategy

Cross-Modality Attentive Feature Fusion for Object Detection in Multispectral Remote Sensing Imagery

DenseFusion: 6D Object Pose Estimation by Iterative Dense Fusion

MOTPose: Multi-object 6D Pose Estimation for Dynamic Video Sequences using Attention-based Temporal Fusion