Conformal Drug Property Prediction with Density Estimation under Covariate Shift

Siddhartha Laghuvarapu,Zhen Lin,Jimeng Sun
2023-10-18
Abstract:In drug discovery, it is vital to confirm the predictions of pharmaceutical properties from computational models using costly wet-lab experiments. Hence, obtaining reliable uncertainty estimates is crucial for prioritizing drug molecules for subsequent experimental validation. Conformal Prediction (CP) is a promising tool for creating such prediction sets for molecular properties with a coverage guarantee. However, the exchangeability assumption of CP is often challenged with covariate shift in drug discovery tasks: Most datasets contain limited labeled data, which may not be representative of the vast chemical space from which molecules are drawn. To address this limitation, we propose a method called CoDrug that employs an energy-based model leveraging both training data and unlabelled data, and Kernel Density Estimation (KDE) to assess the densities of a molecule set. The estimated densities are then used to weigh the molecule samples while building prediction sets and rectifying for distribution shift. In extensive experiments involving realistic distribution drifts in various small-molecule drug discovery tasks, we demonstrate the ability of CoDrug to provide valid prediction sets and its utility in addressing the distribution shift arising from de novo drug design models. On average, using CoDrug can reduce the coverage gap by over 35% when compared to conformal prediction sets not adjusted for covariate shift.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to improve the confidence interval coverage rate of predicting drug molecule properties in the presence of covariate shift in drug discovery. Specifically, the paper proposes a method named CoDrug. This method evaluates the density of molecule sets by using energy models and kernel density estimation (KDE), and utilizes these density values to adjust the prediction sets to correct the distribution shift. This helps to obtain effective uncertainty estimates on molecules generated by new drug design models, thereby improving the efficiency and reliability of the drug discovery process. ### Background of the paper In the drug discovery process, confirming the prediction results of drug properties by computational models usually requires expensive wet - lab experiments for verification. Therefore, obtaining reliable uncertainty estimates is crucial for prioritizing drug molecules for subsequent experimental verification. Traditional conformal prediction (CP) is a promising tool that can create prediction sets of molecule properties with coverage guarantees. However, the exchangeability assumption of CP is often challenged by covariate shift in drug discovery tasks: most datasets contain limited labeled data, which may not represent molecules in a wide chemical space. ### Contributions of the paper 1. **Proposing the CoDrug method**: The CoDrug method evaluates the density of molecule sets by combining energy models of training data and unlabeled data as well as kernel density estimation (KDE), and uses these density values to adjust the prediction sets to correct the distribution shift. 2. **Theoretical guarantee**: It is proved that the kernel density estimation is consistent, which means that in the asymptotic case, the covariate shift is accurately adjusted and the coverage guarantee is restored. 3. **Experimental verification**: Through extensive experiments in various small - molecule drug discovery tasks, the effectiveness of CoDrug in dealing with actual distribution drift is demonstrated. The experimental results show that compared with the conformal prediction method without adjusting the covariate shift, CoDrug can on average reduce the coverage gap by more than 35%. In particular, on molecules generated by de novo drug design models, the coverage gap of CoDrug is on average reduced by 60%. ### Method overview 1. **Conformal prediction (CP) framework**: - **Case without covariate shift**: It is introduced how to use conformal prediction to construct effective prediction sets in the absence of covariate shift. - **Case with covariate shift**: It is proposed how to improve the coverage rate of conformal prediction by adjusting the covariate shift. 2. **CoDrug training method**: - **Energy model**: A framework based on an energy model is proposed. This model distinguishes in - distribution and out - of - distribution data by introducing an additional regularization term. - **Density estimation**: Kernel density estimation (KDE) is used to estimate the densities of the calibration set and the test set, and the weights of the prediction sets are adjusted through these density values. 3. **Experimental results**: - **Performance of the benchmark method**: It is shown that under different distribution shift conditions, the coverage performance of the unweighted conformal prediction method is poor, especially in the cases of fingerprint splitting and scaffold splitting. - **Improvement of weighted conformal prediction**: By using the CoDrug method, the coverage rate of the prediction sets is significantly improved. In particular, in the case of fingerprint splitting, the coverage rate of some categories is increased by more than 25%. ### Conclusion The CoDrug method improves the reliability and accuracy of drug property prediction by effectively dealing with covariate shift, providing strong support for the drug discovery process.