Probing out-of-distribution generalization in machine learning for materials

Kangming Li,Andre Niyongabo Rubungo,Xiangyun Lei,Daniel Persaud,Kamal Choudhary,Brian DeCost,Adji Bousso Dieng,Jason Hattrick-Simpers
2024-06-11
Abstract:Scientific machine learning (ML) endeavors to develop generalizable models with broad applicability. However, the assessment of generalizability is often based on heuristics. Here, we demonstrate in the materials science setting that heuristics based evaluations lead to substantially biased conclusions of ML generalizability and benefits of neural scaling. We evaluate generalization performance in over 700 out-of-distribution tasks that features new chemistry or structural symmetry not present in the training data. Surprisingly, good performance is found in most tasks and across various ML models including simple boosted trees. Analysis of the materials representation space reveals that most tasks contain test data that lie in regions well covered by training data, while poorly-performing tasks contain mainly test data outside the training domain. For the latter case, increasing training set size or training time has marginal or even adverse effects on the generalization performance, contrary to what the neural scaling paradigm assumes. Our findings show that most heuristically-defined out-of-distribution tests are not genuinely difficult and evaluate only the ability to interpolate. Evaluating on such tasks rather than the truly challenging ones can lead to an overestimation of generalizability and benefits of scaling.
Materials Science
What problem does this paper attempt to address?
The paper primarily explores the generalization ability of machine learning (ML) models in materials science, particularly their performance in handling out-of-distribution (OOD) tasks. Specifically, the paper points out that current methods for evaluating model generalization ability are often based on some simple heuristic rules. These rules may be subjective, vary across different studies, and even lead to misunderstandings about generalization ability. The paper systematically analyzes over 700 OOD tasks, which cover cases where new material chemistry or structural features are not present in the training data. The study finds that various existing machine learning models, including simple boosted trees, perform well in most OOD tasks. However, for those tasks that perform poorly, the test data often lie outside the training data domain. Additionally, the paper finds that increasing the training set size or training time does not significantly improve the generalization performance of these challenging OOD tasks, contrary to what the neural scaling paradigm suggests. In summary, the paper reveals that most heuristic-based OOD tests are not truly difficult; they only assess interpolation ability rather than true extrapolation ability. Therefore, evaluating these tasks may overestimate the generalization ability of models and the benefits of scaling. By analyzing the representation space of materials, the paper further illustrates the distinction between well-performing and poorly-performing tasks and proposes a method to differentiate statistically out-of-distribution data from representationally out-of-distribution data. These findings suggest that the OOD tasks chosen by existing methods may be biased, leading to misunderstandings about model generalization ability.