Active Causal Learning for Decoding Chemical Complexities with Targeted Interventions

Zachary R. Fox,Ayana Ghosh
2024-04-06
Abstract:Predicting and enhancing inherent properties based on molecular structures is paramount to design tasks in medicine, materials science, and environmental management. Most of the current machine learning and deep learning approaches have become standard for predictions, but they face challenges when applied across different datasets due to reliance on correlations between molecular representation and target properties. These approaches typically depend on large datasets to capture the diversity within the chemical space, facilitating a more accurate approximation, interpolation, or extrapolation of the chemical behavior of molecules. In our research, we introduce an active learning approach that discerns underlying cause-effect relationships through strategic sampling with the use of a graph loss function. This method identifies the smallest subset of the dataset capable of encoding the most information representative of a much larger chemical space. The identified causal relations are then leveraged to conduct systematic interventions, optimizing the design task within a chemical space that the models have not encountered previously. While our implementation focused on the QM9 quantum-chemical dataset for a specific design task-finding molecules with a large dipole moment-our active causal learning approach, driven by intelligent sampling and interventions, holds potential for broader applications in molecular, materials design and discovery.
Machine Learning,Chemical Physics,Data Analysis, Statistics and Probability,Biomolecules
What problem does this paper attempt to address?
This paper proposes a solution to the problem of machine learning relying on correlation rather than causal relationships in the analysis of chemical complexity. Current methods often require large amounts of data to capture the diversity of molecular space, but this may not accurately predict different chemical behaviors. The researchers introduce an active causal learning approach that identifies the smallest subset in the dataset that encodes the most information through strategic sampling and a graph loss function, revealing potential causal relationships. This method utilizes the identified causal relationships for systematic interventions and optimization design tasks, even in unexplored chemical spaces. The core of the paper is the development of an active learning workflow based on causal discovery models, which learns the structure-property relationships from data subsets and gradually extends to the entire dataset. Causal relationship analysis is performed through a linear causal model, selecting features from SMILES and molecular features, and then connecting these relationships through active learning and graph metrics. Finally, causal interventions are used to design molecules with specific properties, such as larger dipole moments. Experimental results show that the active learning dataset generated by this method converges to the global causal graph faster than randomly selected data, and achieves comparable predictive accuracy of dipole moments as a random forest model while preserving causal relationships. In addition, the paper demonstrates how to design molecules with high dipole moments through causal model interventions, which has potential applications in organic chemistry, drug design, and other fields. In summary, the paper aims to improve the accuracy and interpretability of molecular design by understanding and utilizing causal relationships, overcoming the limitations of relying solely on correlations.