From Words to Poses: Enhancing Novel Object Pose Estimation with Vision Language Models

Tessa Pulli,Stefan Thalhammer,Simon Schwaiger,Markus Vincze
DOI: https://doi.org/10.48550/arXiv.2409.05413
2024-09-09
Abstract:Robots are increasingly envisioned to interact in real-world scenarios, where they must continuously adapt to new situations. To detect and grasp novel objects, zero-shot pose estimators determine poses without prior knowledge. Recently, vision language models (VLMs) have shown considerable advances in robotics applications by establishing an understanding between language input and image input. In our work, we take advantage of VLMs zero-shot capabilities and translate this ability to 6D object pose estimation. We propose a novel framework for promptable zero-shot 6D object pose estimation using language embeddings. The idea is to derive a coarse location of an object based on the relevancy map of a language-embedded NeRF reconstruction and to compute the pose estimate with a point cloud registration method. Additionally, we provide an analysis of LERF's suitability for open-set object pose estimation. We examine hyperparameters, such as activation thresholds for relevancy maps and investigate the zero-shot capabilities on an instance- and category-level. Furthermore, we plan to conduct robotic grasping experiments in a real-world setting.
Computer Vision and Pattern Recognition,Robotics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to perform 6D pose estimation for unseen objects in robotics. Specifically, the authors propose a zero - sample 6D object pose estimation framework based on Vision - Language Models (VLMs). This framework takes advantage of the VLMs' ability to understand new scenes without prior training, locates the target object through text prompts, and uses point - cloud registration methods to estimate the 6D pose of the object. The main contributions of the paper include: 1. **Introducing a zero - sample object pose estimation framework based on language embedding**: This framework can identify and estimate the pose of an object only through natural language descriptions without prior learning of specific objects. 2. **Analyzing the zero - sample capabilities of LERF (Language Embedded Radiance Fields)**: The applicability of LERF in open - vocabulary object pose estimation is studied, and the key requirements for improving its applicability in pose estimation are explored. 3. **Proposing a method that combines NeRF (Neural Radiance Fields) and LERF**: The approximate location of the target object is obtained by using the correlation map generated by LERF, and then the 6D pose of the object is accurately estimated by using point - cloud registration methods such as TEASER++. The paper also discusses future research directions, such as exploring the application potential of zero - sample VLMs in industrial environments and how to overcome the limitations of the assumption of prior knowledge of objects in current methods. These studies aim to enable robots to better adapt to complex tasks in unknown environments, especially in domestic and industrial environments.