Does AI for science need another ImageNet Or totally different benchmarks? A case study of machine learning force fields

Yatao Li,Wanling Gao,Lei Wang,Lixin Sun,Zun Wang,Jianfeng Zhan
2023-08-11
Abstract:AI for science (AI4S) is an emerging research field that aims to enhance the accuracy and speed of scientific computing tasks using machine learning methods. Traditional AI benchmarking methods struggle to adapt to the unique challenges posed by AI4S because they assume data in training, testing, and future real-world queries are independent and identically distributed, while AI4S workloads anticipate out-of-distribution problem instances. This paper investigates the need for a novel approach to effectively benchmark AI for science, using the machine learning force field (MLFF) as a case study. MLFF is a method to accelerate molecular dynamics (MD) simulation with low computational cost and high accuracy. We identify various missed opportunities in scientifically meaningful benchmarking and propose solutions to evaluate MLFF models, specifically in the aspects of sample efficiency, time domain sensitivity, and cross-dataset generalization capabilities. By setting up the problem instantiation similar to the actual scientific applications, more meaningful performance metrics from the benchmark can be achieved. This suite of metrics has demonstrated a better ability to assess a model's performance in real-world scientific applications, in contrast to traditional AI benchmarking methodologies. This work is a component of the SAIBench project, an AI4S benchmarking suite. The project homepage is <a class="link-external link-https" href="https://www.computercouncil.org/SAIBench" rel="external noopener nofollow">this https URL</a>.
Machine Learning,Computational Physics
What problem does this paper attempt to address?
### Problem Addressed by the Paper The paper explores the application of artificial intelligence in the field of scientific research (AI4S), specifically in the machine learning force field (MLFF) methods for molecular dynamics (MD) simulations. The main objective of the paper is to assess whether existing benchmarking methods are suitable for measuring the performance of AI in scientific research tasks and proposes a new set of evaluation criteria for MLFF. #### Core Issues - **Are traditional AI benchmarking methods applicable to AI4S?** Traditional AI benchmarking assumes that training, testing data, and future real-world application data are independently and identically distributed. However, in AI4S, especially in MD simulations, this assumption no longer holds, as data outside the distribution may be encountered. - **How to effectively evaluate MLFF models?** MLFF models are used to accelerate MD simulations, but existing evaluation methods may not fully reflect their performance in actual scientific applications. ### Solution The paper addresses the above issues through the following points: 1. **Sample Efficiency Evaluation:** Examines the performance of MLFF models in data-sparse situations, contrasting with traditional AI tasks (such as large-scale language models or image recognition) where there is usually an abundance of data. 2. **Time Domain Sensitivity Evaluation:** Utilizes the temporal data characteristics generated by MD simulations to assess the model's sensitivity to time series. 3. **Cross-Dataset Generalization Ability Evaluation:** Unlike traditional AI benchmarking that treats different datasets as independent entities, the paper proposes a generalized testing method across datasets to evaluate the model's performance on unseen data. Through these evaluation methods, the paper aims to reveal the limitations of existing evaluation frameworks and propose improvements to better assess the performance of MLFF models in real-world scientific application scenarios. Additionally, the paper discovers an interesting phenomenon: there is a correlation between the model's test results and a similarity measure known as Smooth Overlap of Atomic Positions (SOAP), which can help improve the simulation process.