CLIPose: Category-Level Object Pose Estimation with Pre-trained Vision-Language Knowledge

Xiao Lin,Minghao Zhu,Ronghao Dang,Guangliang Zhou,Shaolong Shu,Feng Lin,Chengju Liu,Qijun Chen
DOI: https://doi.org/10.1109/tcsvt.2024.3397997
IF: 5.859
2024-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:Most of existing category-level object pose estimation methods devote tolearning the object category information from point cloud modality. However,the scale of 3D datasets is limited due to the high cost of 3D data collectionand annotation. Consequently, the category features extracted from theselimited point cloud samples may not be comprehensive. This motivates us toinvestigate whether we can draw on knowledge of other modalities to obtaincategory information. Inspired by this motivation, we propose CLIPose, a novel6D pose framework that employs the pre-trained vision-language model to developbetter learning of object category information, which can fully leverageabundant semantic knowledge in image and text modalities. To make the 3Dencoder learn category-specific features more efficiently, we alignrepresentations of three modalities in feature space via multi-modalcontrastive learning. In addition to exploiting the pre-trained knowledge ofthe CLIP's model, we also expect it to be more sensitive with pose parameters.Therefore, we introduce a prompt tuning approach to fine-tune image encoderwhile we incorporate rotations and translations information in the textdescriptions. CLIPose achieves state-of-the-art performance on two mainstreambenchmark datasets, REAL275 and CAMERA25, and runs in real-time duringinference (40FPS).
What problem does this paper attempt to address?