Abstract:In the real world, autonomous driving agents navigate in highly dynamic environments full of unexpected situations where pre-trained models are unreliable. In these situations, what is immediately available to vehicles is often only human operators. Empowering autonomous driving agents with the ability to navigate in a continuous and dynamic environment and to communicate with humans through sensorimotor-grounded dialogue becomes critical. To this end, we introduce Dialogue On the ROad To Handle Irregular Events (DOROTHIE), a novel interactive simulation platform that enables the creation of unexpected situations on the fly to support empirical studies on situated communication with autonomous driving agents. Based on this platform, we created the Situated Dialogue Navigation (SDN), a navigation benchmark of 183 trials with a total of 8415 utterances, around 18.7 hours of control streams, and 2.9 hours of trimmed audio. SDN is developed to evaluate the agent's ability to predict dialogue moves from humans as well as generate its own dialogue moves and physical navigation actions. We further developed a transformer-based baseline model for these SDN tasks. Our empirical results indicate that language guided-navigation in a highly dynamic environment is an extremely difficult task for end-to-end models. These results will provide insight towards future work on robust autonomous driving agents. The DOROTHIE platform, SDN benchmark, and code for the baseline model are available at <a class="link-external link-https" href="https://github.com/sled-group/DOROTHIE" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how autonomous driving agents can collaborate with humans through dialogue in highly dynamic and unpredictable environments to deal with unexpected situations and complete navigation tasks. Specifically, the paper focuses on how, in autonomous vehicles (AVs), when encountering unexpected situations that pre - trained models cannot reliably handle, to communicate with human operators through natural language, thereby adjusting goals, paths, and trajectories. ### Problem Background In the real world, autonomous vehicles need to navigate in dynamic environments full of uncertainties and unexpected situations. These unexpected situations include bad weather, changes in lighting conditions, the appearance of obstacles, etc., making it difficult for pre - trained models to make reliable decisions. At this time, the only immediately available help is usually human operators. Therefore, it becomes crucial to endow autonomous driving agents with the ability to interact with humans through dialogue. ### Main Contributions of the Paper 1. **DOROTHIE Platform**: Developed a new high - fidelity simulation platform - Dialogue On the ROad To Handle Irregular Events (DOROTHIE) for creating and studying dialogue interactions in unexpected situations. 2. **SDN Benchmark Dataset**: Constructed the Situated Dialogue Navigation (SDN) benchmark dataset, which contains 183 trials, a total of 8,415 dialogues, approximately 18.7 hours of control flow and 2.9 hours of audio data. This dataset aims to evaluate the agent's ability to predict human dialogue behaviors, generate its own dialogue behaviors, and physical navigation actions in a continuous dynamic environment. 3. **Baseline Model**: Developed a Transformer - based baseline model - Temporally - Ordered Task - Oriented Transformer (TOTO) for predicting dialogue behaviors, generating dialogue responses, and navigation actions. ### Main Tasks The paper defines three key tasks: 1. **Understanding of Dialogue (UfN)**: Predict human dialogue behaviors and their semantic slots based on historical dialogue and environmental information. 2. **Response to Dialogue (RfN)**: Generate appropriate dialogue behaviors and their semantic slots based on historical dialogue and environmental information. 3. **Navigation Based on Dialogue (NfD)**: Generate navigation actions and their parameters based on dialogue history. ### Data Collection and Analysis Through the DOROTHIE platform, researchers collected a large amount of human - machine dialogue data and carried out multi - level time - synchronized annotations on these data, including the environment from the first - person perspective, voice input, discrete actions, continuous trajectories, and control signals. In addition, the dialogue structure was also annotated and the dialogue behaviors were analyzed. ### Conclusion The research results of the paper show that in highly dynamic environments, language - based navigation is an extremely difficult task for end - to - end models. These results provide valuable insights for the future development of more powerful autonomous driving agents. ### Example of Formulas When describing certain technical details, formulas may be involved. For example, when describing the TOTO model, the following formula may be used: \[ \text{Input} = [\text{Text Encoding}; \text{Speech Encoding}; \text{Vision Encoding}] + \text{Temporal Encoding} \] where \(\text{Text Encoding}\), \(\text{Speech Encoding}\) and \(\text{Vision Encoding}\) are the encodings of text, speech, and vision inputs respectively, and \(\text{Temporal Encoding}\) is the time encoding. Hopefully, this summary can help you better understand the core content of this paper. If you have more questions or need further explanations, please feel free to let me know!

DOROTHIE: Spoken Dialogue for Handling Unexpected Situations in Interactive Autonomous Driving Agents

An End-to-End Driver Simulator for Personal In-Vehicle Conversational Assistant

DriVLMe: Enhancing LLM-based Autonomous Driving Agents with Embodied and Social Experiences

The RobotSlang Benchmark: Dialog-guided Robot Localization and Navigation

Towards a Progression-Aware Autonomous Dialogue Agent

doScenes: An Autonomous Driving Dataset with Natural Language Instruction for Human Interaction and Vision-Language Navigation

DialFRED: Dialogue-Enabled Agents for Embodied Instruction Following

DialSim: A Real-Time Simulator for Evaluating Long-Term Multi-Party Dialogue Understanding of Conversational Agents

SurrealDriver: Designing LLM-powered Generative Driver Agent Framework based on Human Drivers' Driving-thinking Data

Doe-1: Closed-Loop Autonomous Driving with Large World Model

Talk2Car: Taking Control of Your Self-Driving Car

DISCO: Embodied Navigation and Interaction via Differentiable Scene Semantics and Dual-level Control

Enhancing Socially-Aware Robot Navigation through Bidirectional Natural Language Conversation

ChatPLUG: Open-Domain Generative Dialogue System with Internet-Augmented Instruction Tuning for Digital Human

Can Current Task-oriented Dialogue Models Automate Real-world Scenarios in the Wild?

Benchmarks Underestimate the Readiness of Multi-lingual Dialogue Agents

Conditional Driving from Natural Language Instructions

Generating Driving Simulations via Conversation

InstructTODS: Large Language Models for End-to-End Task-Oriented Dialogue Systems

Making Task-Oriented Dialogue Datasets More Natural by Synthetically Generating Indirect User Requests

Distributed Structured Actor-Critic Reinforcement Learning for Universal Dialogue Management