SimBench: A Rule-Based Multi-Turn Interaction Benchmark for Evaluating an LLM's Ability to Generate Digital Twins

Jingquan Wang,Harry Zhang,Huzaifa Mustafa Unjhawala,Peter Negrut,Shu Wang,Khailanii Slaton,Radu Serban,Jin-Long Wu,Dan Negrut
2024-08-22
Abstract:We introduce SimBench, a benchmark designed to evaluate the proficiency of student large language models (S-LLMs) in generating digital twins (DTs) that can be used in simulators for virtual testing. Given a collection of S-LLMs, this benchmark enables the ranking of the S-LLMs based on their ability to produce high-quality DTs. We demonstrate this by comparing over 20 open- and closed-source S-LLMs. Using multi-turn interactions, SimBench employs a rule-based judge LLM (J-LLM) that leverages both predefined rules and human-in-the-loop guidance to assign scores for the DTs generated by the S-LLM, thus providing a consistent and expert-inspired evaluation protocol. The J-LLM is specific to a simulator, and herein the proposed benchmarking approach is demonstrated in conjunction with the Chrono multi-physics simulator. Chrono provided the backdrop used to assess an S-LLM in relation to the latter's ability to create digital twins for multibody dynamics, finite element analysis, vehicle dynamics, robotic dynamics, and sensor simulations. The proposed benchmarking principle is broadly applicable and enables the assessment of an S-LLM's ability to generate digital twins for other simulation packages. All code and data are available at <a class="link-external link-https" href="https://github.com/uwsbel/SimBench" rel="external noopener nofollow">this https URL</a>.
Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to evaluate the ability of large - scale language models (S - LLMs) to generate digital twins (DTs). Specifically, the paper proposes a benchmarking platform named SimBench, aiming to evaluate the ability of S - LLMs to generate high - quality DTs in simulators through multi - round interactions. The following is a detailed description of this problem: ### 1. **Research Background and Motivation** With the development of digital twin technology, non - expert users hope to be able to generate complex physical system models through interactions with large - scale language models (S - LLMs). For example, a user may request to generate a model of the VIPER lunar rover, which needs to include functions such as terrain, sensors, and autonomous navigation. However, currently, there is a lack of specialized evaluation protocols to measure the ability of S - LLMs to generate these complex models. ### 2. **Deficiencies of Existing Methods** Existing evaluation methods, such as similarity - based metrics like CodeBLEU and ROUGE - L, cannot adapt well to actual simulation applications because they are too rough and cannot capture the nuances of simulation tasks. And performing basic benchmarking (such as pass@k) is too strict, and even a minor defect in the generated DT will result in a zero score. ### 3. **Proposal of SimBench** To solve the above problems, the authors propose SimBench, a rule - based multi - round interaction benchmarking platform. SimBench evaluates S - LLMs in the following ways: - **Multi - round Interaction**: SimBench uses a multi - round interaction method to gradually increase the complexity of tasks in order to comprehensively evaluate the ability of S - LLMs. - **Rule - based Scoring Mechanism**: SimBench introduces a rule - based judge model (J - LLM), which combines predefined rules and feedback from human experts to score the DTs generated by S - LLMs. - **Simulator - specific Evaluation**: The J - LLM is specific to a certain simulator. In this paper, the Chrono multi - physics simulator is used. Chrono is widely used in fields such as multi - body dynamics, finite element analysis, vehicle dynamics, robotics dynamics, and sensor simulation. ### 4. **Specific Objectives** The main objectives of SimBench are: - **Establish Evaluation Criteria**: Provide a reliable evaluation criterion for the ability of S - LLMs to generate DTs. - **Improve Generation Quality**: Through evaluation and feedback, help improve the ability of S - LLMs to generate high - quality DTs. - **Generality**: Although SimBench is designed for the Chrono simulator, its method can be extended to other simulators, such as OpenFOAM or PyBullet. ### 5. **Contributions** The main contributions of the paper include: - **First Proposed Evaluation Framework**: SimBench is the first benchmarking platform specifically designed to evaluate the ability of S - LLMs to generate DTs. - **High - Quality Dataset**: A high - quality dataset is constructed, containing 102 demonstration tasks, covering 34 different physical systems, with three tasks of different complexity for each system. - **General Evaluation Method**: The method of SimBench can be applied to other simulators, only the dataset needs to be adjusted according to the required simulator. ### Summary The problem that this paper attempts to solve is how to effectively evaluate the ability of S - LLMs to generate digital twins, and for this purpose, it proposes the innovative benchmarking platform SimBench. Through multi - round interaction and a rule - based scoring mechanism, SimBench can more accurately evaluate the performance of S - LLMs in complex simulation tasks, thus promoting the development and application of digital twin technology.