Abstract:The advent of foundation models has revolutionized the fields of natural language processing and computer vision, paving the way for their application in autonomous driving (AD). This survey presents a comprehensive review of more than 40 research papers, demonstrating the role of foundation models in enhancing AD. Large language models contribute to planning and simulation in AD, particularly through their proficiency in reasoning, code generation and translation. In parallel, vision foundation models are increasingly adapted for critical tasks such as 3D object detection and tracking, as well as creating realistic driving scenarios for simulation and testing. Multi-modal foundation models, integrating diverse inputs, exhibit exceptional visual understanding and spatial reasoning, crucial for end-to-end AD. This survey not only provides a structured taxonomy, categorizing foundation models based on their modalities and functionalities within the AD domain but also delves into the methods employed in current research. It identifies the gaps between existing foundation models and cutting-edge AD approaches, thereby charting future research directions and proposing a roadmap for bridging these gaps.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to use foundation models to improve the performance of key tasks such as perception, prediction, and planning in the field of autonomous driving. Specifically, the paper explores the following aspects: 1. **Limitations of traditional models**: - Traditional autonomous driving models usually rely on supervised learning and require a large amount of manually - labeled data, which often lacks diversity and limits the generalization ability of the models. - Complex heuristic - rule - based planning systems require a great deal of engineering effort and debugging and are difficult to adapt to complex driving scenarios. 2. **Advantages of foundation models**: - Foundation models have stronger generalization and reasoning abilities by pre - training on large - scale network data. - Visual foundation models can be applied to tasks such as 3D object detection and tracking, and generate realistic driving scenarios for simulation and testing. - Multimodal foundation models can integrate multiple input information, provide excellent visual understanding and spatial reasoning abilities, and are suitable for end - to - end autonomous driving. 3. **Specific application scenarios**: - **Language models (LLMs)**: - In planning and simulation, LLMs can help make more intelligent driving decisions through abilities such as reasoning, code generation, and translation. - LLMs can understand natural language instructions, execute user commands, and improve user experience. - LLMs can also be used for simulation and testing to generate diverse traffic scenarios. - **Visual foundation models**: - Mainly applied to 3D perception and video generation tasks, such as 3D object detection, segmentation, and tracking. - Generate realistic virtual driving scenarios for the simulation and testing of autonomous driving. - **Multimodal foundation models**: - Integrate information of different modalities, such as images, texts, sounds, etc., and perform more complex tasks, such as generating texts from images, analyzing and reasoning about visual inputs. 4. **Deficiencies of existing research and future directions**: - The paper points out that although foundation models have made certain progress in the field of autonomous driving, there are still some challenges, such as the hallucination problem, delay and efficiency problems, dependence on the perception system, and the gap between the simulated environment and the real environment. - Proposed future research directions, including improving the generalization ability of models, increasing the reasoning speed, enhancing the ability to work in synergy with the perception system, and narrowing the gap between simulation and reality. In summary, this paper aims to comprehensively review existing research results, propose a systematic classification framework, analyze the application status of foundation models in autonomous driving, point out the deficiencies of existing research and future research directions, so as to promote the further development of this field.

A Survey for Foundation Models in Autonomous Driving

Prospective Role of Foundation Models in Advancing Autonomous Vehicles

Forging Vision Foundation Models for Autonomous Driving: Challenges, Methodologies, and Opportunities

Applications of Large Scale Foundation Models for Autonomous Driving

Delving into Multi-modal Multi-task Foundation Models for Road Scene Understanding: From Learning Paradigm Perspectives

Robot Learning in the Era of Foundation Models: A Survey

Vision-and-Language Navigation Today and Tomorrow: A Survey in the Era of Foundation Models

Foundation models in robotics: Applications, challenges, and the future

LLM4Drive: A Survey of Large Language Models for Autonomous Driving

A Survey on Robotics with Foundation Models: toward Embodied AI

Vision Language Models in Autonomous Driving: A Survey and Outlook

AI Foundation Models in Remote Sensing: A Survey

Training and Serving System of Foundation Models: A Comprehensive Survey

Foundation Models for Remote Sensing and Earth Observation: A Survey

Foundation Models for Decision Making: Problems, Methods, and Opportunities

A Survey on Large Language Model-empowered Autonomous Driving

Multimodal Foundation Models: From Specialists to General-Purpose Assistants

A Survey of Resource-efficient LLM and Multimodal Foundation Models

Resource-efficient Algorithms and Systems of Foundation Models: A Survey