Language Prompt for Autonomous Driving

Dongming Wu,Wencheng Han,Tiancai Wang,Yingfei Liu,Xiangyu Zhang,Jianbing Shen
2023-09-08
Abstract:A new trend in the computer vision community is to capture objects of interest following flexible human command represented by a natural language prompt. However, the progress of using language prompts in driving scenarios is stuck in a bottleneck due to the scarcity of paired prompt-instance data. To address this challenge, we propose the first object-centric language prompt set for driving scenes within 3D, multi-view, and multi-frame space, named NuPrompt. It expands Nuscenes dataset by constructing a total of 35,367 language descriptions, each referring to an average of 5.3 object tracks. Based on the object-text pairs from the new benchmark, we formulate a new prompt-based driving task, \ie, employing a language prompt to predict the described object trajectory across views and frames. Furthermore, we provide a simple end-to-end baseline model based on Transformer, named PromptTrack. Experiments show that our PromptTrack achieves impressive performance on NuPrompt. We hope this work can provide more new insights for the autonomous driving community. Dataset and Code will be made public at \href{<a class="link-external link-https" href="https://github.com/wudongming97/Prompt4Driving" rel="external noopener nofollow">this https URL</a>}{<a class="link-external link-https" href="https://github.com/wudongming97/Prompt4Driving" rel="external noopener nofollow">this https URL</a>}.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper is primarily dedicated to addressing the problem of object detection and tracking using natural language prompts in autonomous driving scenarios. Specifically, the research team found that the current computer vision community has made progress in using flexible human instructions (given in natural language) to capture objects of interest. However, there are bottlenecks when applying these language prompts in driving scenarios, mainly due to the lack of sufficient paired language descriptions and instance data. To address this issue, the paper proposes a new dataset called NuPrompt, which is the first object-centric language prompt set for 3D, multi-view, and multi-frame spaces in driving scenarios. This dataset extends the Nuscenes dataset by constructing a large number of language descriptions (a total of 35,367 object-prompt pairs), with each description corresponding to an average of 5.3 object trajectories. Based on these new object-text pairs in the dataset, the authors define a new prompt-based driving task, which involves using language prompts to predict the trajectories of described objects across different views and frames. Additionally, the paper proposes a Transformer-based end-to-end baseline model called PromptTrack to address the newly defined task. Experimental results show that PromptTrack performs excellently on the NuPrompt dataset, effectively integrating cross-modal features and predicting the objects indicated by the language prompts. This work is expected to provide new insights and technical support for the field of autonomous driving.