Abstract:Many existing autonomous driving paradigms involve a multi-stage discrete pipeline of tasks. To better predict the control signals and enhance user safety, an end-to-end approach that benefits from joint spatial-temporal feature learning is desirable. While there are some pioneering works on LiDAR-based input or implicit design, in this paper we formulate the problem in an interpretable vision-based setting. In particular, we propose a spatial-temporal feature learning scheme towards a set of more representative features for perception, prediction and planning tasks simultaneously, which is called ST-P3. Specifically, an egocentric-aligned accumulation technique is proposed to preserve geometry information in 3D space before the bird's eye view transformation for perception; a dual pathway modeling is devised to take past motion variations into account for future prediction; a temporal-based refinement unit is introduced to compensate for recognizing vision-based elements for planning. To the best of our knowledge, we are the first to systematically investigate each part of an interpretable end-to-end vision-based autonomous driving system. We benchmark our approach against previous state-of-the-arts on both open-loop nuScenes dataset as well as closed-loop CARLA simulation. The results show the effectiveness of our method. Source code, model and protocol details are made publicly available at <a class="link-external link-https" href="https://github.com/OpenPerceptionX/ST-P3" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The paper attempts to address the inefficiencies and discontinuities in information transmission present in existing multi-stage discrete task pipelines (such as perception, prediction, and planning) in the field of autonomous driving. To better predict control signals and enhance user safety, the researchers aim to build an end-to-end system that optimizes the performance of each task through joint spatial-temporal feature learning. Specifically, the paper proposes a vision-based end-to-end autonomous driving framework **ST-P3** (Spatial-Temporal Feature Learning for Perception, Prediction, and Planning), which aims to: 1. **Improve feature representation**: Generate more representative features for perception, prediction, and planning tasks through spatial-temporal feature learning. 2. **Enhance system interpretability**: Design each module to have higher interpretability and safety. 3. **Reduce dependency on external high-precision maps**: Achieve high-performance autonomous driving systems without relying on high-precision maps. ### Main Contributions 1. **Propose three innovative methods**: - **Egocentric Aligned Accumulation**: Retain geometric information when converting features from perspective view to bird's-eye view (BEV). - **Dual Pathway Modelling**: Enhance the accuracy of future predictions by incorporating historical information. - **Prior-Knowledge Refinement**: Optimize trajectory planning using features from early network stages. 2. **Systematically analyze each part of the end-to-end system**: Provide the first detailed analysis and comparison of a vision-based autonomous driving system, complementing existing LiDAR-based research. 3. **Achieve state-of-the-art performance on multiple benchmark datasets**: Validate on the nuScenes dataset and CARLA simulator, and publicly release the code and protocols. ### Method Overview 1. **Perception Module**: - **Egocentric Aligned Accumulation**: Convert multi-view image features to the 3D space of the current ego-vehicle coordinate system and perform cumulative fusion to retain geometric information. 2. **Prediction Module**: - **Dual Pathway Modelling**: Combine historical features and future uncertainty distributions to recursively predict future states using a mixture of Gaussian models. 3. **Planning Module**: - **Prior-Knowledge Refinement**: Optimize trajectory selection using front-view camera features and high-level commands (such as go straight, turn left, turn right), considering the impact of traffic lights. ### Experimental Results The paper evaluates the performance of ST-P3 in both open-loop (nuScenes dataset) and closed-loop simulation (CARLA simulator) environments, showing that the method outperforms existing methods across multiple tasks. In summary, the paper constructs an interpretable and high-performance end-to-end vision-based autonomous driving system through innovative spatial-temporal feature learning methods, providing new insights for the development of autonomous driving technology.

ST-P3: End-to-end Vision-based Autonomous Driving via Spatial-Temporal Feature Learning

ST-P3: End-to-end Vision-based Autonomous Driving via Spatial-Temporal Feature Learning

Trajectory-guided Control Prediction for End-to-end Autonomous Driving: A Simple yet Strong Baseline

End-to-end Autonomous Driving Perception with Sequential Latent Representation Learning

TBP-Former: Learning Temporal Bird's-Eye-View Pyramid for Joint Perception and Prediction in Vision-Centric Autonomous Driving

Learning End-to-End Autonomous Steering Model from Spatial and Temporal Visual Cues

SparseDrive: End-to-End Autonomous Driving via Sparse Scene Representation

Deep Steering: Learning End-to-End Driving Model from Spatial and Temporal Visual Cues

DeepGoal: Learning to drive with driving intention from human control demonstration

Enhancing scene understanding based on deep learning for end-to-end autonomous driving

Probabilistic End-to-End Vehicle Navigation in Complex Dynamic Environments with Multimodal Sensor Fusion

End-to-End Autonomous Driving without Costly Modularization and 3D Manual Annotation

PilotAttnNet: Multi-modal Attention Network for End-to-End Steering Control.

LiDAR-as-Camera for End-to-End Driving

DriveWorld: 4D Pre-trained Scene Understanding via World Models for Autonomous Driving

DeepDriving: Learning Affordance for Direct Perception in Autonomous Driving

VisionPAD: A Vision-Centric Pre-training Paradigm for Autonomous Driving

HE-Drive: Human-Like End-to-End Driving with Vision Language Models

Project AutoVision: Localization and 3D Scene Perception for an Autonomous Vehicle with a Multi-Camera System

BEVerse: Unified Perception and Prediction in Birds-Eye-View for Vision-Centric Autonomous Driving