Combining VLM and LLM for Enhanced Semantic Object Perception in Robotic Handover Tasks

Qifeng Zhang,Christian Limberg,Jiayang Huang,Qiang Li,Syed Muhammad Nashit Arshad
DOI: https://doi.org/10.1109/WRCSARA64167.2024.10685688
2024-08-23
Abstract:We are utilizing a combination of Large Language Model (LLM) and Vision Language Model (VLM) to perform a robot-to-human handover task with semantic object knowledge. Current object perception systems for this task often work with a fixed set of objects and primarily consider geometric properties, neglecting semantic knowledge about where or where not to grasp an object. By applying LLM and VLM in a zero-shot fashion, we demonstrate that our approach can identify optimal and semantically correct handover parts for both the robot and the human in this handover task. We validate our approach quantitatively across several object categories.
Computer Science,Engineering
What problem does this paper attempt to address?