Abstract:We introduce SimBench, a benchmark designed to evaluate the proficiency of student large language models (S-LLMs) in generating digital twins (DTs) that can be used in simulators for virtual testing. Given a collection of S-LLMs, this benchmark enables the ranking of the S-LLMs based on their ability to produce high-quality DTs. We demonstrate this by comparing over 20 open- and closed-source S-LLMs. Using multi-turn interactions, SimBench employs a rule-based judge LLM (J-LLM) that leverages both predefined rules and human-in-the-loop guidance to assign scores for the DTs generated by the S-LLM, thus providing a consistent and expert-inspired evaluation protocol. The J-LLM is specific to a simulator, and herein the proposed benchmarking approach is demonstrated in conjunction with the Chrono multi-physics simulator. Chrono provided the backdrop used to assess an S-LLM in relation to the latter's ability to create digital twins for multibody dynamics, finite element analysis, vehicle dynamics, robotic dynamics, and sensor simulations. The proposed benchmarking principle is broadly applicable and enables the assessment of an S-LLM's ability to generate digital twins for other simulation packages. All code and data are available at <a class="link-external link-https" href="https://github.com/uwsbel/SimBench" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to evaluate the ability of large - scale language models (S - LLMs) to generate digital twins (DTs). Specifically, the paper proposes a benchmarking platform named SimBench, aiming to evaluate the ability of S - LLMs to generate high - quality DTs in simulators through multi - round interactions. The following is a detailed description of this problem: ### 1. **Research Background and Motivation** With the development of digital twin technology, non - expert users hope to be able to generate complex physical system models through interactions with large - scale language models (S - LLMs). For example, a user may request to generate a model of the VIPER lunar rover, which needs to include functions such as terrain, sensors, and autonomous navigation. However, currently, there is a lack of specialized evaluation protocols to measure the ability of S - LLMs to generate these complex models. ### 2. **Deficiencies of Existing Methods** Existing evaluation methods, such as similarity - based metrics like CodeBLEU and ROUGE - L, cannot adapt well to actual simulation applications because they are too rough and cannot capture the nuances of simulation tasks. And performing basic benchmarking (such as pass@k) is too strict, and even a minor defect in the generated DT will result in a zero score. ### 3. **Proposal of SimBench** To solve the above problems, the authors propose SimBench, a rule - based multi - round interaction benchmarking platform. SimBench evaluates S - LLMs in the following ways: - **Multi - round Interaction**: SimBench uses a multi - round interaction method to gradually increase the complexity of tasks in order to comprehensively evaluate the ability of S - LLMs. - **Rule - based Scoring Mechanism**: SimBench introduces a rule - based judge model (J - LLM), which combines predefined rules and feedback from human experts to score the DTs generated by S - LLMs. - **Simulator - specific Evaluation**: The J - LLM is specific to a certain simulator. In this paper, the Chrono multi - physics simulator is used. Chrono is widely used in fields such as multi - body dynamics, finite element analysis, vehicle dynamics, robotics dynamics, and sensor simulation. ### 4. **Specific Objectives** The main objectives of SimBench are: - **Establish Evaluation Criteria**: Provide a reliable evaluation criterion for the ability of S - LLMs to generate DTs. - **Improve Generation Quality**: Through evaluation and feedback, help improve the ability of S - LLMs to generate high - quality DTs. - **Generality**: Although SimBench is designed for the Chrono simulator, its method can be extended to other simulators, such as OpenFOAM or PyBullet. ### 5. **Contributions** The main contributions of the paper include: - **First Proposed Evaluation Framework**: SimBench is the first benchmarking platform specifically designed to evaluate the ability of S - LLMs to generate DTs. - **High - Quality Dataset**: A high - quality dataset is constructed, containing 102 demonstration tasks, covering 34 different physical systems, with three tasks of different complexity for each system. - **General Evaluation Method**: The method of SimBench can be applied to other simulators, only the dataset needs to be adjusted according to the required simulator. ### Summary The problem that this paper attempts to solve is how to effectively evaluate the ability of S - LLMs to generate digital twins, and for this purpose, it proposes the innovative benchmarking platform SimBench. Through multi - round interaction and a rule - based scoring mechanism, SimBench can more accurately evaluate the performance of S - LLMs in complex simulation tasks, thus promoting the development and application of digital twin technology.

SimBench: A Rule-Based Multi-Turn Interaction Benchmark for Evaluating an LLM's Ability to Generate Digital Twins

SimulBench: Evaluating Language Models with Creative Simulation Tasks

LLM experiments with simulation: Large Language Model Multi-Agent System for Simulation Model Parametrization in Digital Twins

BeSimulator: A Large Language Model Powered Text-based Behavior Simulator

MathBench: Evaluating the Theory and Application Proficiency of LLMs with a Hierarchical Mathematics Benchmark

CS-Bench: A Comprehensive Benchmark for Large Language Models towards Computer Science Mastery

AgentBench: Evaluating LLMs as Agents

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

BabelBench: An Omni Benchmark for Code-Driven Analysis of Multimodal and Multistructured Data

How Far Are LLMs from Believable AI? A Benchmark for Evaluating the Believability of Human Behavior Simulation

SensorBench: Benchmarking LLMs in Coding-Based Sensor Processing

JudgeBench: A Benchmark for Evaluating LLM-based Judges

GenSim: Generating Robotic Simulation Tasks via Large Language Models

A User-Centric Multi-Intent Benchmark for Evaluating Large Language Models

SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

CoderUJB: An Executable and Unified Java Benchmark for Practical Programming Scenarios

A User-Centric Benchmark for Evaluating Large Language Models.

CEBench: A Benchmarking Toolkit for the Cost-Effectiveness of LLM Pipelines

AQA-Bench: An Interactive Benchmark for Evaluating LLMs' Sequential Reasoning Ability