GPT-4V as Traffic Assistant: An In-depth Look at Vision Language Model on Complex Traffic Events

Xingcheng Zhou,Alois C. Knoll
2024-02-07
Abstract:The recognition and understanding of traffic incidents, particularly traffic accidents, is a topic of paramount importance in the realm of intelligent transportation systems and intelligent vehicles. This area has continually captured the extensive focus of both the academic and industrial sectors. Identifying and comprehending complex traffic events is highly challenging, primarily due to the intricate nature of traffic environments, diverse observational perspectives, and the multifaceted causes of accidents. These factors have persistently impeded the development of effective solutions. The advent of large vision-language models (VLMs) such as GPT-4V, has introduced innovative approaches to addressing this issue. In this paper, we explore the ability of GPT-4V with a set of representative traffic incident videos and delve into the model's capacity of understanding these complex traffic situations. We observe that GPT-4V demonstrates remarkable cognitive, reasoning, and decision-making ability in certain classic traffic events. Concurrently, we also identify certain limitations of GPT-4V, which constrain its understanding in more intricate scenarios. These limitations merit further exploration and resolution.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to address the problem of identifying and understanding traffic accidents in complex traffic events. Specifically, the paper explores the performance of GPT-4V, a large visual language model, in handling various typical traffic events. The authors evaluate GPT-4V's performance through a series of representative traffic event videos in the following aspects: 1. **Identifying Traffic Events**: Whether GPT-4V can correctly identify traffic events in the videos. 2. **Describing Traffic Events**: Whether GPT-4V can accurately describe the details of the events, including the type of accident, vehicle information, etc. 3. **Causal Reasoning**: Whether GPT-4V can perform reasonable causal reasoning to explain the causes of the accidents. 4. **Decision-Making Ability**: Whether GPT-4V can propose reasonable emergency response measures. Through the analysis of successful and failed cases, the paper demonstrates GPT-4V's excellent performance in some classic traffic events while also pointing out its limitations in handling more complex scenarios. These limitations include insufficient spatial reasoning ability, difficulty in recognizing small objects, and poor performance in nighttime or long-distance scenes. By providing a detailed analysis of these cases, the authors hope to offer references for future research to further enhance the capability of visual language models in traffic event identification and understanding.