NMM-HRI: Natural Multi-modal Human-Robot Interaction with Voice and Deictic Posture via Large Language Model

Yuzhi Lai,Shenghai Yuan,Youssef Nassar,Mingyu Fan,Atmaraaj Gopal,Arihiro Yorita,Naoyuki Kubota,Matthias Rätsch
2025-01-01
Abstract:Translating human intent into robot commands is crucial for the future of service robots in an aging society. Existing Human-Robot Interaction (HRI) systems relying on gestures or verbal commands are impractical for the elderly due to difficulties with complex syntax or sign language. To address the challenge, this paper introduces a multi-modal interaction framework that combines voice and deictic posture information to create a more natural HRI system. The visual cues are first processed by the object detection model to gain a global understanding of the environment, and then bounding boxes are estimated based on depth information. By using a large language model (LLM) with voice-to-text commands and temporally aligned selected bounding boxes, robot action sequences can be generated, while key control syntax constraints are applied to avoid potential LLM hallucination issues. The system is evaluated on real-world tasks with varying levels of complexity using a Universal Robots UR3e manipulator. Our method demonstrates significantly better performance in HRI in terms of accuracy and robustness. To benefit the research community and the general public, we will make our code and design open-source.
Robotics
What problem does this paper attempt to address?
The key problem that this paper attempts to solve is: **How to design a more natural, intuitive and easy - to - use multi - modal human - robot interaction (HRI) system for the elderly, so that they can interact with service robots easily and reliably**. ### Specific problems and challenges include: 1. **Limitations of existing HRI systems**: Current HRI systems usually rely on gestures or voice commands, but these methods are not practical for the elderly. For example, complex grammatical structures or sign languages are difficult for the elderly to master. 2. **Challenges in the application of LLM in HRI**: Although large - language models (LLM) perform well in enhancing communication between humans and robots, there are several problems in directly applying LLM to HRI: - It requires users to input detailed and structured text commands, which may be too complicated for the elderly. - LLM lacks integrated perception ability and it is difficult to understand environmental context or specific actions. - LLM is prone to "hallucinations", that is, generating inaccurate or unsafe responses, which may lead to harmful consequences in control systems. ### Solutions proposed in the paper: To address the above challenges, the authors propose an HRI framework based on natural multi - modal fusion (NMM - HRI), which combines voice and deictic posture information, enabling users to compile a series of actions through simple and intuitive languages and identify objects or locations through deictic postures. Specifically: - **Visual cue processing**: First, use an object detection model to obtain a global environmental understanding, and then estimate the bounding box based on depth information. - **Application of LLM**: Use large - language models to process voice - to - text commands and combine time - aligned selected bounding boxes to generate robot action sequences. - **Control grammar constraints**: Apply key control grammar constraints to avoid potential LLM hallucination problems. - **Real - time evaluation**: It was evaluated using the Universal Robots UR3e manipulator in real - world tasks, demonstrating significant advantages in terms of accuracy and robustness of this method. ### Main contributions: 1. Proposed NMM - HRI, a parallel multi - modal HRI method that can efficiently construct complex temporal control sequences. 2. Alleviate the hallucination problem of LLM in HRI settings by structuring output response tokens to ensure safety. 3. Demonstrated better performance than existing HRI methods in benchmark tests, especially in reducing grammar token memory and achieving fast input speed. In conclusion, this paper aims to improve the interaction efficiency and reliability between the elderly and service robots by introducing the NMM - HRI framework, so as to better meet the needs of an aging society.