Abstract:The pursuit of autonomous driving technology hinges on the sophisticated integration of perception, decision-making, and control systems. Traditional approaches, both data-driven and rule-based, have been hindered by their inability to grasp the nuance of complex driving environments and the intentions of other road users. This has been a significant bottleneck, particularly in the development of common sense reasoning and nuanced scene understanding necessary for safe and reliable autonomous driving. The advent of Visual Language Models (VLM) represents a novel frontier in realizing fully autonomous vehicle driving. This report provides an exhaustive evaluation of the latest state-of-the-art VLM, GPT-4V(ision), and its application in autonomous driving scenarios. We explore the model's abilities to understand and reason about driving scenes, make decisions, and ultimately act in the capacity of a driver. Our comprehensive tests span from basic scene recognition to complex causal reasoning and real-time decision-making under varying conditions. Our findings reveal that GPT-4V demonstrates superior performance in scene understanding and causal reasoning compared to existing autonomous systems. It showcases the potential to handle out-of-distribution scenarios, recognize intentions, and make informed decisions in real driving contexts. However, challenges remain, particularly in direction discernment, traffic light recognition, vision grounding, and spatial reasoning tasks. These limitations underscore the need for further research and development. Project is now available on GitHub for interested parties to access and utilize: \url{<a class="link-external link-https" href="https://github.com/PJLab-ADG/GPT4V-AD-Exploration" rel="external noopener nofollow">this https URL</a>}

Parallel Vision for Long-Tail Regularization: Initial Results from IVFC Autonomous Driving Testing

Unifying Terrain Awareness Through Real-Time Semantic Segmentation

The parallel implementation of 3-D vision processing and understanding for ALV

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

Visual Point Cloud Forecasting enables Scalable Autonomous Driving

VLP: Vision Language Planning for Autonomous Driving

On the Road with GPT-4V(ision): Early Explorations of Visual-Language Model on Autonomous Driving

Forging Vision Foundation Models for Autonomous Driving: Challenges, Methodologies, and Opportunities

On-Board Vision-Language Models for Personalized Autonomous Vehicle Motion Control: System Design and Real-World Validation

V2X-VLM: End-to-End V2X Cooperative Autonomous Driving Through Large Vision-Language Models

Vision-Driven 2D Supervised Fine-Tuning Framework for Bird's Eye View Perception

Fine-Grained Vehicle Perception via 3D Part-Guided Visual Data Augmentation

VLM-Auto: VLM-based Autonomous Driving Assistant with Human-like Behavior and Understanding for Complex Road Scenes

LiDAR-based End-to-end Temporal Perception for Vehicle-Infrastructure Cooperation

Dynamically Expanding Capacity of Autonomous Driving with Near-Miss Focused Training Framework

VI-eye: semantic-based 3D point cloud registration for infrastructure-assisted autonomous driving

Automated Evaluation of Large Vision-Language Models on Self-driving Corner Cases

Real-to-Virtual Domain Unification for End-to-End Autonomous Driving

Significant Obstacle Location with Ultra-Wide FOV LWIR Stereo Vision System

Regulating Intermediate 3D Features for Vision-Centric Autonomous Driving

LF-VIO: A Visual-Inertial-Odometry Framework for Large Field-of-View Cameras with Negative Plane