From Words to Poses: Enhancing Novel Object Pose Estimation with Vision Language Models

Tessa Pulli,Stefan Thalhammer,Simon Schwaiger,Markus Vincze

DOI: https://doi.org/10.48550/arXiv.2409.05413

2024-09-09

Abstract:Robots are increasingly envisioned to interact in real-world scenarios, where they must continuously adapt to new situations. To detect and grasp novel objects, zero-shot pose estimators determine poses without prior knowledge. Recently, vision language models (VLMs) have shown considerable advances in robotics applications by establishing an understanding between language input and image input. In our work, we take advantage of VLMs zero-shot capabilities and translate this ability to 6D object pose estimation. We propose a novel framework for promptable zero-shot 6D object pose estimation using language embeddings. The idea is to derive a coarse location of an object based on the relevancy map of a language-embedded NeRF reconstruction and to compute the pose estimate with a point cloud registration method. Additionally, we provide an analysis of LERF's suitability for open-set object pose estimation. We examine hyperparameters, such as activation thresholds for relevancy maps and investigate the zero-shot capabilities on an instance- and category-level. Furthermore, we plan to conduct robotic grasping experiments in a real-world setting.

Computer Vision and Pattern Recognition,Robotics

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to perform 6D pose estimation for unseen objects in robotics. Specifically, the authors propose a zero - sample 6D object pose estimation framework based on Vision - Language Models (VLMs). This framework takes advantage of the VLMs' ability to understand new scenes without prior training, locates the target object through text prompts, and uses point - cloud registration methods to estimate the 6D pose of the object. The main contributions of the paper include: 1. **Introducing a zero - sample object pose estimation framework based on language embedding**: This framework can identify and estimate the pose of an object only through natural language descriptions without prior learning of specific objects. 2. **Analyzing the zero - sample capabilities of LERF (Language Embedded Radiance Fields)**: The applicability of LERF in open - vocabulary object pose estimation is studied, and the key requirements for improving its applicability in pose estimation are explored. 3. **Proposing a method that combines NeRF (Neural Radiance Fields) and LERF**: The approximate location of the target object is obtained by using the correlation map generated by LERF, and then the 6D pose of the object is accurately estimated by using point - cloud registration methods such as TEASER++. The paper also discusses future research directions, such as exploring the application potential of zero - sample VLMs in industrial environments and how to overcome the limitations of the assumption of prior knowledge of objects in current methods. These studies aim to enable robots to better adapt to complex tasks in unknown environments, especially in domestic and industrial environments.

From Words to Poses: Enhancing Novel Object Pose Estimation with Vision Language Models

ZS6D: Zero-shot 6D Object Pose Estimation using Vision Transformers

Dream2Real: Zero-Shot 3D Object Rearrangement with Vision-Language Models

Expanding Frozen Vision-Language Models without Retraining: Towards Improved Robot Perception

Diffusion Features for Zero-Shot 6DoF Object Pose Estimation

Vision-Based Categorical Object Pose Estimation and Manipulation.

VLN-Game: Vision-Language Equilibrium Search for Zero-Shot Semantic Navigation

VLPose: Bridging the Domain Gap in Pose Estimation with Language-Vision Tuning

High-resolution open-vocabulary object 6D pose estimation

VLFM: Vision-Language Frontier Maps for Zero-Shot Semantic Navigation

Zero123-6D: Zero-shot Novel View Synthesis for RGB Category-level 6D Pose Estimation

LAMP: Leveraging Language Prompts for Multi-person Pose Estimation

Open-vocabulary object 6D pose estimation

VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

Task-oriented Robotic Manipulation with Vision Language Models

Reflectance Estimation for Proximity Sensing by Vision-Language Models: Utilizing Distributional Semantics for Low-Level Cognition in Robotics

Structured Spatial Reasoning with Open Vocabulary Object Detectors

LanPose: Language-Instructed 6D Object Pose Estimation for Robotic Assembly

Physically Grounded Vision-Language Models for Robotic Manipulation

HiFi-CS: Towards Open Vocabulary Visual Grounding For Robotic Grasping Using Vision-Language Models

Leveraging Large (Visual) Language Models for Robot 3D Scene Understanding