Benchmarking the Capabilities of Large Language Models in Transportation System Engineering: Accuracy, Consistency, and Reasoning Behaviors

Usman Syed,Ethan Light,Xingang Guo,Huan Zhang,Lianhui Qin,Yanfeng Ouyang,Bin Hu

2024-08-16

Abstract:In this paper, we explore the capabilities of state-of-the-art large language models (LLMs) such as GPT-4, GPT-4o, Claude 3.5 Sonnet, Claude 3 Opus, Gemini 1.5 Pro, Llama 3, and Llama 3.1 in solving some selected undergraduate-level transportation engineering problems. We introduce TransportBench, a benchmark dataset that includes a sample of transportation engineering problems on a wide range of subjects in the context of planning, design, management, and control of transportation systems. This dataset is used by human experts to evaluate the capabilities of various commercial and open-sourced LLMs, especially their accuracy, consistency, and reasoning behaviors, in solving transportation engineering problems. Our comprehensive analysis uncovers the unique strengths and limitations of each LLM, e.g. our analysis shows the impressive accuracy and some unexpected inconsistent behaviors of Claude 3.5 Sonnet in solving TransportBench problems. Our study marks a thrilling first step toward harnessing artificial general intelligence for complex transportation challenges.

Artificial Intelligence,Computation and Language,Machine Learning

What problem does this paper attempt to address?

The problem this paper attempts to address is the evaluation of the capabilities of current state-of-the-art large language models (LLMs) in solving undergraduate-level problems in the field of transportation systems engineering. Specifically, the authors introduce a benchmark dataset called TransportBench, which covers a wide range of topics including transportation system planning, design, management, and control. Through this dataset, the authors assess the accuracy, consistency, and reasoning behavior of different commercial and open-source large language models in solving transportation engineering problems. The main objective of the study is to understand the unique strengths and limitations of these models, particularly in their ability to handle complex transportation challenges. The paper presents a detailed analysis of the performance of different models, such as Claude 3.5 Sonnet, which demonstrated impressive accuracy and some unexpected inconsistencies when solving TransportBench problems. This study marks an important step towards leveraging artificial general intelligence to address complex transportation issues.

Benchmarking the Capabilities of Large Language Models in Transportation System Engineering: Accuracy, Consistency, and Reasoning Behaviors

Capabilities of Large Language Models in Control Engineering: A Benchmark Study on GPT-4, Claude 3 Opus, and Gemini 1.0 Ultra

TransportationGames: Benchmarking Transportation Knowledge of (Multimodal) Large Language Models

Beyond Words: Evaluating Large Language Models in Transportation Planning

Large Language Models for Intelligent Transportation: A Review of the State of the Art and Challenges

CityBench: Evaluating the Capabilities of Large Language Models for Urban Tasks

NLPBench: Evaluating Large Language Models on Solving NLP Problems

Testing Large Language Models on Driving Theory Knowledge and Skills for Connected Autonomous Vehicles

Assessing Large Language Models in Mechanical Engineering Education: A Study on Mechanics-Focused Conceptual Understanding

Evaluating Large Language Models with Grid-Based Game Competitions: An Extensible LLM Benchmark and Leaderboard

SimulBench: Evaluating Language Models with Creative Simulation Tasks

LLM4DS: Evaluating Large Language Models for Data Science Code Generation

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

Large Language Models for Autonomous Driving (LLM4AD): Concept, Benchmark, Simulation, and Real-Vehicle Experiment

MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration

Evaluating Large Language Models on Graphs: Performance Insights and Comparative Analysis

Putting GPT-4o to the Sword: A Comprehensive Evaluation of Language, Vision, Speech, and Multimodal Proficiency

Large Language Models as Automated Aligners for benchmarking Vision-Language Models

How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments

GameTraversalBenchmark: Evaluating Planning Abilities Of Large Language Models Through Traversing 2D Game Maps

STBench: Assessing the Ability of Large Language Models in Spatio-Temporal Analysis