DOROTHIE: Spoken Dialogue for Handling Unexpected Situations in Interactive Autonomous Driving Agents

Ziqiao Ma,Ben VanDerPloeg,Cristian-Paul Bara,Huang Yidong,Eui-In Kim,Felix Gervits,Matthew Marge,Joyce Chai
DOI: https://doi.org/10.48550/arXiv.2210.12511
2022-10-23
Abstract:In the real world, autonomous driving agents navigate in highly dynamic environments full of unexpected situations where pre-trained models are unreliable. In these situations, what is immediately available to vehicles is often only human operators. Empowering autonomous driving agents with the ability to navigate in a continuous and dynamic environment and to communicate with humans through sensorimotor-grounded dialogue becomes critical. To this end, we introduce Dialogue On the ROad To Handle Irregular Events (DOROTHIE), a novel interactive simulation platform that enables the creation of unexpected situations on the fly to support empirical studies on situated communication with autonomous driving agents. Based on this platform, we created the Situated Dialogue Navigation (SDN), a navigation benchmark of 183 trials with a total of 8415 utterances, around 18.7 hours of control streams, and 2.9 hours of trimmed audio. SDN is developed to evaluate the agent's ability to predict dialogue moves from humans as well as generate its own dialogue moves and physical navigation actions. We further developed a transformer-based baseline model for these SDN tasks. Our empirical results indicate that language guided-navigation in a highly dynamic environment is an extremely difficult task for end-to-end models. These results will provide insight towards future work on robust autonomous driving agents. The DOROTHIE platform, SDN benchmark, and code for the baseline model are available at <a class="link-external link-https" href="https://github.com/sled-group/DOROTHIE" rel="external noopener nofollow">this https URL</a>.
Artificial Intelligence,Computation and Language,Computer Vision and Pattern Recognition,Robotics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how autonomous driving agents can collaborate with humans through dialogue in highly dynamic and unpredictable environments to deal with unexpected situations and complete navigation tasks. Specifically, the paper focuses on how, in autonomous vehicles (AVs), when encountering unexpected situations that pre - trained models cannot reliably handle, to communicate with human operators through natural language, thereby adjusting goals, paths, and trajectories. ### Problem Background In the real world, autonomous vehicles need to navigate in dynamic environments full of uncertainties and unexpected situations. These unexpected situations include bad weather, changes in lighting conditions, the appearance of obstacles, etc., making it difficult for pre - trained models to make reliable decisions. At this time, the only immediately available help is usually human operators. Therefore, it becomes crucial to endow autonomous driving agents with the ability to interact with humans through dialogue. ### Main Contributions of the Paper 1. **DOROTHIE Platform**: Developed a new high - fidelity simulation platform - Dialogue On the ROad To Handle Irregular Events (DOROTHIE) for creating and studying dialogue interactions in unexpected situations. 2. **SDN Benchmark Dataset**: Constructed the Situated Dialogue Navigation (SDN) benchmark dataset, which contains 183 trials, a total of 8,415 dialogues, approximately 18.7 hours of control flow and 2.9 hours of audio data. This dataset aims to evaluate the agent's ability to predict human dialogue behaviors, generate its own dialogue behaviors, and physical navigation actions in a continuous dynamic environment. 3. **Baseline Model**: Developed a Transformer - based baseline model - Temporally - Ordered Task - Oriented Transformer (TOTO) for predicting dialogue behaviors, generating dialogue responses, and navigation actions. ### Main Tasks The paper defines three key tasks: 1. **Understanding of Dialogue (UfN)**: Predict human dialogue behaviors and their semantic slots based on historical dialogue and environmental information. 2. **Response to Dialogue (RfN)**: Generate appropriate dialogue behaviors and their semantic slots based on historical dialogue and environmental information. 3. **Navigation Based on Dialogue (NfD)**: Generate navigation actions and their parameters based on dialogue history. ### Data Collection and Analysis Through the DOROTHIE platform, researchers collected a large amount of human - machine dialogue data and carried out multi - level time - synchronized annotations on these data, including the environment from the first - person perspective, voice input, discrete actions, continuous trajectories, and control signals. In addition, the dialogue structure was also annotated and the dialogue behaviors were analyzed. ### Conclusion The research results of the paper show that in highly dynamic environments, language - based navigation is an extremely difficult task for end - to - end models. These results provide valuable insights for the future development of more powerful autonomous driving agents. ### Example of Formulas When describing certain technical details, formulas may be involved. For example, when describing the TOTO model, the following formula may be used: \[ \text{Input} = [\text{Text Encoding}; \text{Speech Encoding}; \text{Vision Encoding}] + \text{Temporal Encoding} \] where \(\text{Text Encoding}\), \(\text{Speech Encoding}\) and \(\text{Vision Encoding}\) are the encodings of text, speech, and vision inputs respectively, and \(\text{Temporal Encoding}\) is the time encoding. Hopefully, this summary can help you better understand the core content of this paper. If you have more questions or need further explanations, please feel free to let me know!