Is machine learning good or bad for the natural sciences?

David W. Hogg,Soledad Villar
2024-06-01
Abstract:Machine learning (ML) methods are having a huge impact across all of the sciences. However, ML has a strong ontology - in which only the data exist - and a strong epistemology - in which a model is considered good if it performs well on held-out training data. These philosophies are in strong conflict with both standard practices and key philosophies in the natural sciences. Here we identify some locations for ML in the natural sciences at which the ontology and epistemology are valuable. For example, when an expressive machine learning model is used in a causal inference to represent the effects of confounders, such as foregrounds, backgrounds, or instrument calibration parameters, the model capacity and loose philosophy of ML can make the results more trustworthy. We also show that there are contexts in which the introduction of ML introduces strong, unwanted statistical biases. For one, when ML models are used to emulate physical (or first-principles) simulations, they amplify confirmation biases. For another, when expressive regressions are used to label datasets, those labels cannot be used in downstream joint or ensemble analyses without taking on uncontrolled biases. The question in the title is being asked of all of the natural sciences; that is, we are calling on the scientific communities to take a step back and consider the role and value of ML in their fields; the (partial) answers we give here come from the particular perspective of physics.
Machine Learning,Instrumentation and Methods for Astrophysics,Data Analysis, Statistics and Probability
What problem does this paper attempt to address?
This paper discusses the role of machine learning (ML) in natural science and whether it is beneficial or harmful. The author points out that although ML is widely used in various scientific fields, its ontological view (only focusing on the existence of data) and epistemology (performance on validation data as the criterion for success) conflict with the goal of understanding and explaining the world pursued by natural science. The paper mentions that ML has value in natural science, especially in causal inference, where complex models representing confounding factors (such as foreground, background, or instrument calibration parameters) can improve the credibility of results. However, ML also introduces some statistical biases, such as amplifying confirmation bias when used to replace or enhance physical simulations in modeling, or labeling datasets with expression regression, which leads to uncontrollable bias in downstream joint or integrated analysis. The author emphasizes that ML has a safe and necessary space in certain operational aspects of scientific projects, but its role and value in understanding natural phenomena are still unclear. The paper calls for reflection and evaluation of the role and value of ML in the natural science community. Overall, the paper suggests that ML has both benefits and potential problems in natural science and should be used with caution.