LaMI: Large Language Models for Multi-Modal Human-Robot Interaction

Chao Wang,Stephan Hasler,Daniel Tanneberg,Felix Ocker,Frank Joublin,Antonello Ceravola,Joerg Deigmoeller,Michael Gienger
DOI: https://doi.org/10.1145/3613905.3651029
2024-04-12
Abstract:This paper presents an innovative large language model (LLM)-based robotic system for enhancing multi-modal human-robot interaction (HRI). Traditional HRI systems relied on complex designs for intent estimation, reasoning, and behavior generation, which were resource-intensive. In contrast, our system empowers researchers and practitioners to regulate robot behavior through three key aspects: providing high-level linguistic guidance, creating "atomic actions" and expressions the robot can use, and offering a set of examples. Implemented on a physical robot, it demonstrates proficiency in adapting to multi-modal inputs and determining the appropriate manner of action to assist humans with its arms, following researchers' defined guidelines. Simultaneously, it coordinates the robot's lid, neck, and ear movements with speech output to produce dynamic, multi-modal expressions. This showcases the system's potential to revolutionize HRI by shifting from conventional, manual state-and-flow design methods to an intuitive, guidance-based, and example-driven approach. Supplementary material can be found at
Robotics,Human-Computer Interaction
What problem does this paper attempt to address?
The paper aims to address several key issues in multimodal Human-Robot Interaction (HRI). Traditional HRI systems rely on complex designs for intent estimation, reasoning, and behavior generation, which consume significant resources. The paper proposes an innovative Large Language Model (LLM)-driven robotic system to enhance multimodal HRI. Specifically, the system achieves this through the following three key aspects: 1. **High-level Language Guidance**: Researchers and practitioners can provide high-level guidance through natural language. 2. **Atomic Actions and Expressions**: Define "atomic" actions and expressions that the robot can use. 3. **Example Set**: Provide a set of examples to assist the robot's behavior. The system is capable of running on physical robots, demonstrating the ability to adapt to multimodal inputs and determine appropriate actions to assist humans. Additionally, it can coordinate the movements of the robot's lid, neck, and ears with speech output, resulting in dynamic multimodal expressions. This system showcases the potential to transform traditional manual state flow design methods into intuitive guidance and example-driven approaches, thereby revolutionizing the field of human-robot interaction.