Abstract:An accurate model of the environment and the dynamic agents acting in it offers great potential for improving motion planning. We present MILE: a Model-based Imitation LEarning approach to jointly learn a model of the world and a policy for autonomous driving. Our method leverages 3D geometry as an inductive bias and learns a highly compact latent space directly from high-resolution videos of expert demonstrations. Our model is trained on an offline corpus of urban driving data, without any online interaction with the environment. MILE improves upon prior state-of-the-art by 31% in driving score on the CARLA simulator when deployed in a completely new town and new weather conditions. Our model can predict diverse and plausible states and actions, that can be interpretably decoded to bird's-eye view semantic segmentation. Further, we demonstrate that it can execute complex driving manoeuvres from plans entirely predicted in imagination. Our approach is the first camera-only method that models static scene, dynamic scene, and ego-behaviour in an urban driving environment. The code and model weights are available at <a class="link-external link-https" href="https://github.com/wayveai/mile" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to address the motion planning problem for autonomous vehicles in urban driving. Specifically, the authors propose a model-based imitation learning method (MILE) that can jointly learn a world model and driving policy. The main objectives of the paper include: 1. **Improving Driving Performance**: MILE improves driving scores by 31% over existing methods on the CARLA simulator under new city and weather conditions. 2. **Handling Complex Visual Inputs**: MILE can process high-resolution urban driving scene images without any online environment interaction or reward signals. 3. **Multimodal Future Prediction**: MILE can predict diverse and plausible future states and actions, decoding them into bird's-eye view semantic segmentation for interpretability and visualization. 4. **Complex Driving Maneuvers**: MILE can predict the entire driving plan in imagination, successfully executing complex driving maneuvers such as navigating roundabouts or avoiding motorcyclists. ### Main Contributions 1. **Novel Model Architecture**: MILE leverages 3D geometry as an inductive bias, extending to the visual complexity of autonomous driving in urban environments. The method is trained using only offline expert driving data, without interacting with the online environment or accessing reward signals, showing potential for practical applications. 2. **New Performance Standards**: MILE outperforms other methods on the CARLA simulator, including those requiring LiDAR input. 3. **Multimodal Future Prediction**: MILE can predict diverse and plausible future states and actions, capable of executing complex driving maneuvers from fully predicted plans. ### Related Work - **Imitation Learning**: Early autonomous driving methods primarily used modular frameworks, with each module addressing a specific task. Recently, end-to-end self-driving systems have shown potential to improve driving performance by predicting driving commands from high-dimensional observations. MILE further develops this by using 3D geometry and offline data for training. - **3D Scene Representation**: Successful autonomous driving planning requires understanding and reasoning about 3D scenes, which is challenging with monocular cameras. MILE addresses this by lifting image features to 3D and pooling them into bird's-eye view representations. - **World Models**: Model-based methods have been explored mainly in reinforcement learning settings with great success. MILE learns policies directly from offline datasets by learning the latent dynamics of the world from image observations, without needing access to reward functions. - **Trajectory Prediction**: The goal of trajectory prediction is to estimate the future trajectories of dynamic agents using past physical states and scene context. MILE models not only dynamic scenes but also static scenes and self-behavior, without accessing real physical states or offline HD maps. ### Method Overview The core of MILE lies in its model-based imitation learning architecture, which can jointly control autonomous vehicles and model the world and its dynamics. Specific steps include: 1. **Generative Model**: Defines a generative model by introducing latent variables to model temporal dynamics. 2. **Variational Inference**: Introduces variational distributions to infer latent variables, maximizing the marginal likelihood of observed data. 3. **Inference Network**: Parameterizes the variational distribution to estimate the posterior distribution. 4. **Generative Network**: Parameterizes the generative model to estimate the prior distribution and the distribution of observations, bird's-eye view segmentation, and actions. 5. **Imagination of Future States and Actions**: Infers actions through the learned policy, predicts future deterministic states, and samples from the prior distribution to generate future sequences in the latent space. ### Experimental Results - **Driving Performance**: MILE performs excellently in new cities and weather conditions on the CARLA simulator, significantly outperforming existing methods. - **Ablation Studies**: Validates the importance of 3D geometry and probabilistic modeling by comparing the impact of different design decisions. - **Fully Recursive Reasoning in Closed-Loop Driving**: Demonstrates MILE's performance in closed-loop driving, showing that the fully recursive method is not only comparable in performance but also more computationally efficient than the reset state method. - **Long-Horizon, Multimodal Future Prediction**: MILE can predict diverse and plausible future states and actions, decoding them into bird's-eye view semantic segmentation for interpretability and visualization.

Model-Based Imitation Learning for Urban Driving

Imitation Learning of Hierarchical Driving Model: from Continuous Intention to Continuous Trajectory

Hierarchical Model-Based Imitation Learning for Planning in Autonomous Driving

Deep Imitation Learning for Autonomous Driving in Generic Urban Scenarios with Enhanced Safety

Hybrid Imitation-Learning Motion Planner for Urban Driving

CCIL: Context-conditioned imitation learning for urban driving

Yaw-Guided Imitation Learning for Autonomous Driving in Urban Environments

Safe Imitation Learning on Real-Life Highway Data for Human-like Autonomous Driving

Iterative Imitation Policy Improvement for Interactive Autonomous Driving

LASIL: Learner-Aware Supervised Imitation Learning For Long-term Microscopic Traffic Simulation

Learning Hierarchical Behavior and Motion Planning for Autonomous Driving.

MRIC: Model-Based Reinforcement-Imitation Learning with Mixture-of-Codebooks for Autonomous Driving Simulation

Dynamic Conditional Imitation Learning for Autonomous Driving

Guided Policy Search Model-based Reinforcement Learning for Urban Autonomous Driving

Conditional Affordance Learning for Driving in Urban Environments

MPC-based Imitation Learning for Safe and Human-like Autonomous Driving

Imitation Is Not Enough: Robustifying Imitation with Reinforcement Learning for Challenging Driving Scenarios

Policy-Based Reinforcement Learning for Training Autonomous Driving Agents in Urban Areas With Affordance Learning

End-to-End Learning of Driving Models with Surround-View Cameras and Route Planners

Evaluation of MPC-based Imitation Learning for Human-like Autonomous Driving

Learning to Drive from a World on Rails