Coherence-Driven Multimodal Safety Dialogue with Active Learning for Embodied Agents

Sabit Hassan,Hye-Young Chung,Xiang Zhi Tan,Malihe Alikhani

2024-10-18

Abstract:When assisting people in daily tasks, robots need to accurately interpret visual cues and respond effectively in diverse safety-critical situations, such as sharp objects on the floor. In this context, we present M-CoDAL, a multimodal-dialogue system specifically designed for embodied agents to better understand and communicate in safety-critical situations. The system leverages discourse coherence relations to enhance its contextual understanding and communication abilities. To train this system, we introduce a novel clustering-based active learning mechanism that utilizes an external Large Language Model (LLM) to identify informative instances. Our approach is evaluated using a newly created multimodal dataset comprising 1K safety violations extracted from 2K Reddit images. These violations are annotated using a Large Multimodal Model (LMM) and verified by human annotators. Results with this dataset demonstrate that our approach improves resolution of safety situations, user sentiment, as well as safety of the conversation. Next, we deploy our dialogue system on a Hello Robot Stretch robot and conduct a within-subject user study with real-world participants. In the study, participants role-play two safety scenarios with different levels of severity with the robot and receive interventions from our model and a baseline system powered by OpenAI's ChatGPT. The study results corroborate and extend the findings from automated evaluation, showing that our proposed system is more persuasive and competent in a real-world embodied agent setting.

Robotics,Computation and Language

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how robots can accurately interpret visual cues and effectively respond in various safety - critical situations when assisting people in daily tasks. Specifically, the paper proposes a multi - modal dialogue system named M - CoDAL, which aims to enhance the robot's context understanding and communication capabilities, especially when dealing with safety - related situations. The system utilizes discourse coherence relations to improve its situational understanding ability and communication skills. To train this system, the authors introduce a clustering - based active learning mechanism, which uses an external large - language model (LLM) to identify informative instances. In addition, the study also constructs a new multi - modal dataset, which contains 1,000 safety hazards extracted from Reddit images. These hazards are labeled by a large multi - modal model (LMM) and manually verified. Through the evaluation of this dataset, the results show that this method has improved in solving safety situations, user emotions, and dialogue security. Finally, the researchers deployed the dialogue system on a Hello Robot Stretch robot and conducted experiments with real users, further verifying the effectiveness and advantages of the system.

Coherence-Driven Multimodal Safety Dialogue with Active Learning for Embodied Agents

Multimodal Reinforcement Learning for Robots Collaborating with Humans

Multimodal Situational Safety

Human-Robot Dialogue Annotation for Multi-Modal Common Ground

Chat with the Environment: Interactive Multimodal Perception Using Large Language Models

Building Cooperative Embodied Agents Modularly with Large Language Models

SafeEmbodAI: a Safety Framework for Mobile Robots in Embodied AI Systems

CoCA: Regaining Safety-awareness of Multimodal Large Language Models with Constitutional Calibration

CoSafe: Evaluating Large Language Model Safety in Multi-Turn Dialogue Coreference

Conversational Language Models for Human-in-the-Loop Multi-Robot Coordination

When Robots Get Chatty: Grounding Multimodal Human-Robot Conversation and Collaboration

Multimodal Human-Autonomous Agents Interaction Using Pre-Trained Language and Visual Foundation Models

TalkWithMachines: Enhancing Human-Robot Interaction for Interpretable Industrial Robotics Through Large/Vision Language Models

Multimodal Activation: Awakening Dialog Robots Without Wake Words

Using Multimodal Large Language Models (MLLMs) for Automated Detection of Traffic Safety-Critical Events

ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation

Simulating User Agents for Embodied Conversational-AI

Multi-modal open-domain dialogue

LaMI: Large Language Models for Multi-Modal Human-Robot Interaction

BadRobot: Manipulating Embodied LLMs in the Physical World

Athena: Safe Autonomous Agents with Verbal Contrastive Learning