MoleCLUEs: Molecular Conformers Maximally In-Distribution for Predictive Models

Michael Maser,Natasa Tagasovska,Jae Hyeon Lee,Andrew Watkins
2023-11-07
Abstract:Structure-based molecular ML (SBML) models can be highly sensitive to input geometries and give predictions with large variance. We present an approach to mitigate the challenge of selecting conformations for such models by generating conformers that explicitly minimize predictive uncertainty. To achieve this, we compute estimates of aleatoric and epistemic uncertainties that are differentiable w.r.t. latent posteriors. We then iteratively sample new latents in the direction of lower uncertainty by gradient descent. As we train our predictive models jointly with a conformer decoder, the new latent embeddings can be mapped to their corresponding inputs, which we call \textit{MoleCLUEs}, or (molecular) counterfactual latent uncertainty explanations \citep{antoran2020getting}. We assess our algorithm for the task of predicting drug properties from 3D structure with maximum confidence. We additionally analyze the structure trajectories obtained from conformer optimizations, which provide insight into the sources of uncertainty in SBML.
Machine Learning,Quantitative Methods
What problem does this paper attempt to address?
The main goal of this paper is to address the challenges faced by Structure-Based Molecular Machine Learning (SBML) models in predicting molecular properties, particularly the issue of increased prediction uncertainty when input geometries are out of the training data distribution. The authors propose a method called MoleCLUEs, which aims to generate molecular conformations that minimize prediction uncertainty. Specifically, the problems addressed in the paper can be summarized as follows: 1. **Uncertainty in SBML models**: SBML models are highly sensitive to input geometries and may produce predictions with high variance. This is mainly due to the lack of guiding principles for conformation selection in new molecules, leading to increased prediction uncertainty. 2. **Challenges in conformation selection**: Current methods often assume that new conformations follow the same distribution as those in the training set when selecting molecular conformations for prediction. However, in real-world scenarios, this assumption is hard to guarantee, resulting in poor generalization of the model. 3. **Need in high-risk scenarios**: In high-risk applications such as drug discovery, more precise and reliable prediction results are required. Therefore, it is necessary to develop methods to adjust or correct model biases introduced by 3D structure generation and reduce uncertainty when predicting labels for out-of-distribution (OOD) input geometries. To address the above issues, the paper proposes the MoleCLUEs method, which is implemented through the following steps: - **Different differentiable uncertainty estimations**: Calculate measures that characterize prediction uncertainty, including aleatoric uncertainty representing data noise and epistemic uncertainty representing knowledge or data deficiency. - **Counterfactual conformation generation**: Use these uncertainty estimations to guide the sampling process, generating new latent representations corresponding to new, in-distribution conformations, referred to as MoleCLUEs. - **Optimization process**: Iteratively sample new latent representations in the direction of gradient descent to reduce uncertainty. - **Evaluation**: Experiments validate that the MoleCLUEs method effectively reduces prediction uncertainty and improves prediction accuracy, especially when dealing with conformations with artificially added noise. In summary, this study aims to improve the reliability and accuracy of SBML models in applications such as drug discovery by enhancing the method of selecting molecular conformations.