Abstract:Translating human intent into robot commands is crucial for the future of service robots in an aging society. Existing Human-Robot Interaction (HRI) systems relying on gestures or verbal commands are impractical for the elderly due to difficulties with complex syntax or sign language. To address the challenge, this paper introduces a multi-modal interaction framework that combines voice and deictic posture information to create a more natural HRI system. The visual cues are first processed by the object detection model to gain a global understanding of the environment, and then bounding boxes are estimated based on depth information. By using a large language model (LLM) with voice-to-text commands and temporally aligned selected bounding boxes, robot action sequences can be generated, while key control syntax constraints are applied to avoid potential LLM hallucination issues. The system is evaluated on real-world tasks with varying levels of complexity using a Universal Robots UR3e manipulator. Our method demonstrates significantly better performance in HRI in terms of accuracy and robustness. To benefit the research community and the general public, we will make our code and design open-source.

What problem does this paper attempt to address?

The key problem that this paper attempts to solve is: **How to design a more natural, intuitive and easy - to - use multi - modal human - robot interaction (HRI) system for the elderly, so that they can interact with service robots easily and reliably**. ### Specific problems and challenges include: 1. **Limitations of existing HRI systems**: Current HRI systems usually rely on gestures or voice commands, but these methods are not practical for the elderly. For example, complex grammatical structures or sign languages are difficult for the elderly to master. 2. **Challenges in the application of LLM in HRI**: Although large - language models (LLM) perform well in enhancing communication between humans and robots, there are several problems in directly applying LLM to HRI: - It requires users to input detailed and structured text commands, which may be too complicated for the elderly. - LLM lacks integrated perception ability and it is difficult to understand environmental context or specific actions. - LLM is prone to "hallucinations", that is, generating inaccurate or unsafe responses, which may lead to harmful consequences in control systems. ### Solutions proposed in the paper: To address the above challenges, the authors propose an HRI framework based on natural multi - modal fusion (NMM - HRI), which combines voice and deictic posture information, enabling users to compile a series of actions through simple and intuitive languages and identify objects or locations through deictic postures. Specifically: - **Visual cue processing**: First, use an object detection model to obtain a global environmental understanding, and then estimate the bounding box based on depth information. - **Application of LLM**: Use large - language models to process voice - to - text commands and combine time - aligned selected bounding boxes to generate robot action sequences. - **Control grammar constraints**: Apply key control grammar constraints to avoid potential LLM hallucination problems. - **Real - time evaluation**: It was evaluated using the Universal Robots UR3e manipulator in real - world tasks, demonstrating significant advantages in terms of accuracy and robustness of this method. ### Main contributions: 1. Proposed NMM - HRI, a parallel multi - modal HRI method that can efficiently construct complex temporal control sequences. 2. Alleviate the hallucination problem of LLM in HRI settings by structuring output response tokens to ensure safety. 3. Demonstrated better performance than existing HRI methods in benchmark tests, especially in reducing grammar token memory and achieving fast input speed. In conclusion, this paper aims to improve the interaction efficiency and reliability between the elderly and service robots by introducing the NMM - HRI framework, so as to better meet the needs of an aging society.

NMM-HRI: Natural Multi-modal Human-Robot Interaction with Voice and Deictic Posture via Large Language Model

LaMI: Large Language Models for Multi-Modal Human-Robot Interaction

Design of Kinect-Based Human Robot Interaction Systems for A Robocup Middle Size League Soccer Robot

Multimodal Human-Autonomous Agents Interaction Using Pre-Trained Language and Visual Foundation Models

Large Language Models as Zero-Shot Human Models for Human-Robot Interaction

A Multimodal Emotional Communication Based Humans-Robots Interaction System

MHRC: Closed-loop Decentralized Multi-Heterogeneous Robot Collaboration with Large Language Models

Enhancing the LLM-Based Robot Manipulation Through Human-Robot Collaboration

A Novel Gesture Recognition System for Intelligent Interaction with a Nursing-Care Assistant Robot

Understanding Large-Language Model (LLM)-powered Human-Robot Interaction

Real-Time Multi-modal Human-Robot Collaboration Using Gestures and Speech

VoicePilot: Harnessing LLMs as Speech Interfaces for Physically Assistive Robots

Learning Multimodal Latent Dynamics for Human-Robot Interaction

Enhancing Human–Robot Collaboration through a Multi-Module Interaction Framework with Sensor Fusion: Object Recognition, Verbal Communication, User of Interest Detection, Gesture and Gaze Recognition

Implementation of Engagement Detection for Human–Robot Interaction in Complex Environments

Human–robot interaction via voice-controllable intelligent user interface

Language and Sketching: An LLM-driven Interactive Multimodal Multitask Robot Navigation Framework

Recent advancements in multimodal human-robot interaction

Multimodal Reinforcement Learning for Robots Collaborating with Humans

Probabilistic Multimodal Modeling for Human-Robot Interaction Tasks