Simulating User Agents for Embodied Conversational-AI

Daniel Philipov,Vardhan Dongre,Gokhan Tur,Dilek Hakkani-Tür
2024-10-31
Abstract:Embodied agents designed to assist users with tasks must engage in natural language interactions, interpret instructions, execute actions, and communicate effectively to resolve issues. However, collecting large-scale, diverse datasets of situated human-robot dialogues to train and evaluate such agents is expensive, labor-intensive, and time-consuming. To address this challenge, we propose building a large language model (LLM)-based user agent that can simulate user behavior during interactions with an embodied agent in a virtual environment. Given a user goal (e.g., make breakfast), at each time step, the user agent may observe" the robot actions or speak" to either intervene with the robot or answer questions. Such a user agent assists in improving the scalability and efficiency of embodied dialogues dataset generation and is critical for enhancing and evaluating the robot's interaction and task completion ability, as well as for research in reinforcement learning using AI feedback. We evaluate our user agent's ability to generate human-like behaviors by comparing its simulated dialogues with the TEACh dataset. We perform three experiments: zero-shot prompting to predict dialogue acts, few-shot prompting, and fine-tuning on the TEACh training subset. Results show the LLM-based user agent achieves an F-measure of 42% with zero-shot prompting and 43.4% with few-shot prompting in mimicking human speaking behavior. Through fine-tuning, performance in deciding when to speak remained stable, while deciding what to say improved from 51.1% to 62.5%. These findings showcase the feasibility of the proposed approach for assessing and enhancing the effectiveness of robot task completion through natural language communication.
Computation and Language,Artificial Intelligence,Robotics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to simulate user behavior by constructing a user agent based on large - language models (LLMs) to improve the efficiency and quality of dataset generation for interaction with embodied agents. Specifically, the paper aims to solve the following problems: 1. **High cost and inefficiency of dataset collection**: - Collecting large - scale, diverse situated human - robot dialogues datasets for training and evaluating embodied agents is both expensive and time - consuming. 2. **Improving the task - completion ability and interaction effect of embodied agents**: - By simulating user behavior, it helps to improve and evaluate the interaction ability of embodied agents in the task - completion process, ensuring that they can understand user instructions naturally and communicate effectively. 3. **Enhancing the possibilities of future research**: - Provide a feasible method to evaluate and enhance the effectiveness and reliability of robot task - completion through natural - language communication, such as using AI feedback for reinforcement learning. To solve these problems, the authors propose a user - agent framework based on large - language models (LLMs), which can simulate user behavior in a virtual environment and interact with embodied agents. This user agent can "observe" the robot's actions or "speak" at each time step to actively intervene in the robot's behavior or respond to the robot's questions. In this way, the embodied - dialogue dataset can be generated more efficiently and support future embodied - AI research. ### Specific implementation methods - **Zero - shot and few - shot prompting**: Use a small number of examples to guide the LLM to predict user behavior. - **Fine - tuning**: Fine - tune the LLM on a specific dataset to improve its accuracy in predicting user behavior. ### Experimental results The experimental results show that the proposed LLM - based user agent has certain feasibility in imitating human speaking behavior, especially after fine - tuning, its performance is significantly improved. Specifically: - The F - measure under zero - shot prompting reaches 42%. - The F - measure under few - shot prompting reaches 43.4%. - After fine - tuning, it has achieved greater success in deciding when to speak and what to say, increasing from 51.1% to 62.5%. These results demonstrate the potential of this method in evaluating and enhancing the effectiveness and reliability of embodied agents in completing tasks through natural - language communication.