Open-Vocabulary Category-Level Object Pose and Size Estimation
Junhao Cai,Yisheng He,Weihao Yuan,Siyu Zhu,Zilong Dong,Liefeng Bo,Qifeng Chen
DOI: https://doi.org/10.1109/lra.2024.3430156
IF: 5.2
2024-01-01
IEEE Robotics and Automation Letters
Abstract:This paper studies a new open-set problem, the open-vocabulary category-levelobject pose and size estimation. Given human text descriptions of arbitrarynovel object categories, the robot agent seeks to predict the position,orientation, and size of the target object in the observed scene image. Toenable such generalizability, we first introduce OO3D-9D, a large-scalephotorealistic dataset for this task. Derived from OmniObject3D, OO3D-9D is thelargest and most diverse dataset in the field of category-level object pose andsize estimation. It includes additional annotations for the symmetry axis ofeach category, which help resolve symmetric ambiguity. Apart from thelarge-scale dataset, we find another key to enabling such generalizability isleveraging the strong prior knowledge in pre-trained visual-language foundationmodels. We then propose a framework built on pre-trained DinoV2 andtext-to-image stable diffusion models to infer the normalized object coordinatespace (NOCS) maps of the target instances. This framework fully leverages thevisual semantic prior from DinoV2 and the aligned visual and language knowledgewithin the text-to-image diffusion model, which enables generalization tovarious text descriptions of novel categories. Comprehensive quantitative andqualitative experiments demonstrate that the proposed open-vocabulary method,trained on our large-scale synthesized data, significantly outperforms thebaseline and can effectively generalize to real-world images of unseencategories. The project page is at https://ov9d.github.io.