ST-P3: End-to-end Vision-based Autonomous Driving via Spatial-Temporal Feature Learning

Shengchao Hu,Li Chen,Penghao Wu,Hongyang Li,Junchi Yan,Dacheng Tao
2022-07-18
Abstract:Many existing autonomous driving paradigms involve a multi-stage discrete pipeline of tasks. To better predict the control signals and enhance user safety, an end-to-end approach that benefits from joint spatial-temporal feature learning is desirable. While there are some pioneering works on LiDAR-based input or implicit design, in this paper we formulate the problem in an interpretable vision-based setting. In particular, we propose a spatial-temporal feature learning scheme towards a set of more representative features for perception, prediction and planning tasks simultaneously, which is called ST-P3. Specifically, an egocentric-aligned accumulation technique is proposed to preserve geometry information in 3D space before the bird's eye view transformation for perception; a dual pathway modeling is devised to take past motion variations into account for future prediction; a temporal-based refinement unit is introduced to compensate for recognizing vision-based elements for planning. To the best of our knowledge, we are the first to systematically investigate each part of an interpretable end-to-end vision-based autonomous driving system. We benchmark our approach against previous state-of-the-arts on both open-loop nuScenes dataset as well as closed-loop CARLA simulation. The results show the effectiveness of our method. Source code, model and protocol details are made publicly available at <a class="link-external link-https" href="https://github.com/OpenPerceptionX/ST-P3" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper attempts to address the inefficiencies and discontinuities in information transmission present in existing multi-stage discrete task pipelines (such as perception, prediction, and planning) in the field of autonomous driving. To better predict control signals and enhance user safety, the researchers aim to build an end-to-end system that optimizes the performance of each task through joint spatial-temporal feature learning. Specifically, the paper proposes a vision-based end-to-end autonomous driving framework **ST-P3** (Spatial-Temporal Feature Learning for Perception, Prediction, and Planning), which aims to: 1. **Improve feature representation**: Generate more representative features for perception, prediction, and planning tasks through spatial-temporal feature learning. 2. **Enhance system interpretability**: Design each module to have higher interpretability and safety. 3. **Reduce dependency on external high-precision maps**: Achieve high-performance autonomous driving systems without relying on high-precision maps. ### Main Contributions 1. **Propose three innovative methods**: - **Egocentric Aligned Accumulation**: Retain geometric information when converting features from perspective view to bird's-eye view (BEV). - **Dual Pathway Modelling**: Enhance the accuracy of future predictions by incorporating historical information. - **Prior-Knowledge Refinement**: Optimize trajectory planning using features from early network stages. 2. **Systematically analyze each part of the end-to-end system**: Provide the first detailed analysis and comparison of a vision-based autonomous driving system, complementing existing LiDAR-based research. 3. **Achieve state-of-the-art performance on multiple benchmark datasets**: Validate on the nuScenes dataset and CARLA simulator, and publicly release the code and protocols. ### Method Overview 1. **Perception Module**: - **Egocentric Aligned Accumulation**: Convert multi-view image features to the 3D space of the current ego-vehicle coordinate system and perform cumulative fusion to retain geometric information. 2. **Prediction Module**: - **Dual Pathway Modelling**: Combine historical features and future uncertainty distributions to recursively predict future states using a mixture of Gaussian models. 3. **Planning Module**: - **Prior-Knowledge Refinement**: Optimize trajectory selection using front-view camera features and high-level commands (such as go straight, turn left, turn right), considering the impact of traffic lights. ### Experimental Results The paper evaluates the performance of ST-P3 in both open-loop (nuScenes dataset) and closed-loop simulation (CARLA simulator) environments, showing that the method outperforms existing methods across multiple tasks. In summary, the paper constructs an interpretable and high-performance end-to-end vision-based autonomous driving system through innovative spatial-temporal feature learning methods, providing new insights for the development of autonomous driving technology.