Abstract:AI for science (AI4S) is an emerging research field that aims to enhance the accuracy and speed of scientific computing tasks using machine learning methods. Traditional AI benchmarking methods struggle to adapt to the unique challenges posed by AI4S because they assume data in training, testing, and future real-world queries are independent and identically distributed, while AI4S workloads anticipate out-of-distribution problem instances. This paper investigates the need for a novel approach to effectively benchmark AI for science, using the machine learning force field (MLFF) as a case study. MLFF is a method to accelerate molecular dynamics (MD) simulation with low computational cost and high accuracy. We identify various missed opportunities in scientifically meaningful benchmarking and propose solutions to evaluate MLFF models, specifically in the aspects of sample efficiency, time domain sensitivity, and cross-dataset generalization capabilities. By setting up the problem instantiation similar to the actual scientific applications, more meaningful performance metrics from the benchmark can be achieved. This suite of metrics has demonstrated a better ability to assess a model's performance in real-world scientific applications, in contrast to traditional AI benchmarking methodologies. This work is a component of the SAIBench project, an AI4S benchmarking suite. The project homepage is <a class="link-external link-https" href="https://www.computercouncil.org/SAIBench" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problem Addressed by the Paper The paper explores the application of artificial intelligence in the field of scientific research (AI4S), specifically in the machine learning force field (MLFF) methods for molecular dynamics (MD) simulations. The main objective of the paper is to assess whether existing benchmarking methods are suitable for measuring the performance of AI in scientific research tasks and proposes a new set of evaluation criteria for MLFF. #### Core Issues - **Are traditional AI benchmarking methods applicable to AI4S?** Traditional AI benchmarking assumes that training, testing data, and future real-world application data are independently and identically distributed. However, in AI4S, especially in MD simulations, this assumption no longer holds, as data outside the distribution may be encountered. - **How to effectively evaluate MLFF models?** MLFF models are used to accelerate MD simulations, but existing evaluation methods may not fully reflect their performance in actual scientific applications. ### Solution The paper addresses the above issues through the following points: 1. **Sample Efficiency Evaluation:** Examines the performance of MLFF models in data-sparse situations, contrasting with traditional AI tasks (such as large-scale language models or image recognition) where there is usually an abundance of data. 2. **Time Domain Sensitivity Evaluation:** Utilizes the temporal data characteristics generated by MD simulations to assess the model's sensitivity to time series. 3. **Cross-Dataset Generalization Ability Evaluation:** Unlike traditional AI benchmarking that treats different datasets as independent entities, the paper proposes a generalized testing method across datasets to evaluate the model's performance on unseen data. Through these evaluation methods, the paper aims to reveal the limitations of existing evaluation frameworks and propose improvements to better assess the performance of MLFF models in real-world scientific application scenarios. Additionally, the paper discovers an interesting phenomenon: there is a correlation between the model's test results and a similarity measure known as Smooth Overlap of Atomic Positions (SOAP), which can help improve the simulation process.

Does AI for science need another ImageNet Or totally different benchmarks? A case study of machine learning force fields

Aibench: an industry standard ai benchmark suite

AIBench: Towards Scalable and Comprehensive Datacenter AI Benchmarking

AIBench: An Industry Standard AI Benchmark Suite from Internet Services

P F ] 1 3 A ug 2 01 9 HPC AI 500 : A Benchmark Suite for HPC AI Systems

AIBench Scenario: Scenario-Distilling AI Benchmarking.

HPC AI500: A Benchmark Suite for HPC AI Systems

Forces are not Enough: Benchmark and Critical Evaluation for Machine Learning Force Fields with Molecular Simulations

SAIBench: Benchmarking AI for Science

Scientific Machine Learning Benchmarks

AIBench Training: Balanced Industry-Standard AI Training Benchmarking

Artificial Intelligence for Science in Quantum, Atomistic, and Continuum Systems

BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices

Artificial intelligence: A powerful paradigm for scientific research

HPC AI500: Representative, Repeatable and Simple HPC AI Benchmarking

Machine Learning Force Fields with Data Cost Aware Training

AIPerf: Automated machine learning as an AI-HPC benchmark

Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?

Machine Learning and Big Scientific Data

SuperBench: A Super-Resolution Benchmark Dataset for Scientific Machine Learning