Abstract:Embodied agents designed to assist users with tasks must engage in natural language interactions, interpret instructions, execute actions, and communicate effectively to resolve issues. However, collecting large-scale, diverse datasets of situated human-robot dialogues to train and evaluate such agents is expensive, labor-intensive, and time-consuming. To address this challenge, we propose building a large language model (LLM)-based user agent that can simulate user behavior during interactions with an embodied agent in a virtual environment. Given a user goal (e.g., make breakfast), at each time step, the user agent may observe" the robot actions or speak" to either intervene with the robot or answer questions. Such a user agent assists in improving the scalability and efficiency of embodied dialogues dataset generation and is critical for enhancing and evaluating the robot's interaction and task completion ability, as well as for research in reinforcement learning using AI feedback. We evaluate our user agent's ability to generate human-like behaviors by comparing its simulated dialogues with the TEACh dataset. We perform three experiments: zero-shot prompting to predict dialogue acts, few-shot prompting, and fine-tuning on the TEACh training subset. Results show the LLM-based user agent achieves an F-measure of 42% with zero-shot prompting and 43.4% with few-shot prompting in mimicking human speaking behavior. Through fine-tuning, performance in deciding when to speak remained stable, while deciding what to say improved from 51.1% to 62.5%. These findings showcase the feasibility of the proposed approach for assessing and enhancing the effectiveness of robot task completion through natural language communication.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to simulate user behavior by constructing a user agent based on large - language models (LLMs) to improve the efficiency and quality of dataset generation for interaction with embodied agents. Specifically, the paper aims to solve the following problems: 1. **High cost and inefficiency of dataset collection**: - Collecting large - scale, diverse situated human - robot dialogues datasets for training and evaluating embodied agents is both expensive and time - consuming. 2. **Improving the task - completion ability and interaction effect of embodied agents**: - By simulating user behavior, it helps to improve and evaluate the interaction ability of embodied agents in the task - completion process, ensuring that they can understand user instructions naturally and communicate effectively. 3. **Enhancing the possibilities of future research**: - Provide a feasible method to evaluate and enhance the effectiveness and reliability of robot task - completion through natural - language communication, such as using AI feedback for reinforcement learning. To solve these problems, the authors propose a user - agent framework based on large - language models (LLMs), which can simulate user behavior in a virtual environment and interact with embodied agents. This user agent can "observe" the robot's actions or "speak" at each time step to actively intervene in the robot's behavior or respond to the robot's questions. In this way, the embodied - dialogue dataset can be generated more efficiently and support future embodied - AI research. ### Specific implementation methods - **Zero - shot and few - shot prompting**: Use a small number of examples to guide the LLM to predict user behavior. - **Fine - tuning**: Fine - tune the LLM on a specific dataset to improve its accuracy in predicting user behavior. ### Experimental results The experimental results show that the proposed LLM - based user agent has certain feasibility in imitating human speaking behavior, especially after fine - tuning, its performance is significantly improved. Specifically: - The F - measure under zero - shot prompting reaches 42%. - The F - measure under few - shot prompting reaches 43.4%. - After fine - tuning, it has achieved greater success in deciding when to speak and what to say, increasing from 51.1% to 62.5%. These results demonstrate the potential of this method in evaluating and enhancing the effectiveness and reliability of embodied agents in completing tasks through natural - language communication.

Simulating User Agents for Embodied Conversational-AI

Dialogue Learning with Human-in-the-Loop.

Learning through Dialogue Interactions by Asking Questions

Uman-in-thel oop

TEACh: Task-Driven Embodied Agents That Chat

Large Language Models as User-Agents for Evaluating Task-Oriented-Dialogue Systems

More than Chit-Chat: Developing Robots for Small-Talk Interactions

Real or Robotic? Assessing Whether LLMs Accurately Simulate Qualities of Human Responses in Dialogue

A User Simulator for Task-Completion Dialogues

LLM Roleplay: Simulating Human-Chatbot Interaction

User Simulation with Large Language Models for Evaluating Task-Oriented Dialogue

Synthetic Dialogue Dataset Generation using LLM Agents

Reliable LLM-based User Simulator for Task-Oriented Dialogue Systems

Continual Skill and Task Learning via Dialogue

Multimodal Human-Autonomous Agents Interaction Using Pre-Trained Language and Visual Foundation Models

Open-Ended Instructable Embodied Agents with Memory-Augmented Large Language Models

DialSim: A Real-Time Simulator for Evaluating Long-Term Multi-Party Dialogue Understanding of Conversational Agents

Bootstrapping LLM-based Task-Oriented Dialogue Agents via Self-Talk

Hello Again! LLM-powered Personalized Agent for Long-term Dialogue

LLM-Mediated Domain-Specific Voice Agents: The Case of TextileBot

Understanding Large-Language Model (LLM)-powered Human-Robot Interaction