Abstract:Machine learning (ML) is increasingly being used to guide biological discovery in biomedicine such as prioritizing promising small molecules in drug discovery. In those applications, ML models are used to predict the properties of biological systems, and researchers use these predictions to prioritize candidates as new biological hypotheses for downstream experimental validations. However, when applied to unseen situations, these models can be overconfident and produce a large number of false positives. One solution to address this issue is to quantify the model's prediction uncertainty and provide a set of hypotheses with a controlled false discovery rate (FDR) pre-specified by researchers. We propose CPEC, an ML framework for FDR-controlled biological discovery. We demonstrate its effectiveness using enzyme function annotation as a case study, simulating the discovery process of identifying the functions of less-characterized enzymes. CPEC integrates a deep learning model with a statistical tool known as conformal prediction, providing accurate and FDR-controlled function predictions for a given protein enzyme. Conformal prediction provides rigorous statistical guarantees to the predictive model and ensures that the expected FDR will not exceed a user-specified level with high probability. Evaluation experiments show that CPEC achieves reliable FDR control, better or comparable prediction performance at a lower FDR than existing methods, and accurate predictions for enzymes under-represented in the training data. We expect CPEC to be a useful tool for biological discovery applications where a high yield rate in validation experiments is desired but the experimental budget is limited. Machine learning (ML) models are increasingly being applied as predictors to generate biological hypotheses and guide biological discovery. However, when applied to unseen situations, ML models can be overconfident and make enormous false positive predictions, resulting in the challenges for researchers to trade-off between high yield rates and limited budgets. One solution is to quantify the model's prediction uncertainty and generate predictions at a controlled false discovery rate (FDR) pre-specified by researchers. Here, we introduce CPEC, an ML framework designed for FDR-controlled biological discovery. Using enzyme function prediction as a case study, we simulate the process of function discovery for less-characterized enzymes. Leveraging a statistical framework, conformal prediction, CPEC provides rigorous statistical guarantees that the FDR of the model predictions will not surpass a user-specified level with high probability. Our results suggested that CPEC achieved reliable FDR control for enzymes under-represented in the training data. In the broader context of biological discovery applications, CPEC can be applied to generate high-confidence hypotheses and guide researchers to allocate experimental resources to the validation of hypotheses that are more likely to succeed.

Model-free selective inference and its applications to drug discovery

Selection by Prediction with Conformal p-values

Conformal Selection for Efficient and Accurate Compound Screening in Drug Discovery

Model-free selective inference under covariate shift via weighted conformal p-values

A decision theoretic approach to model evaluation in computational drug discovery

Efficient Biological Data Acquisition through Inference Set Design

Novel Big Data-Driven Machine Learning Models for Drug Discovery Application

A decision-theoretic approach to the evaluation of machine learning algorithms in computational drug discovery

Inferring Interactions Between Novel Drugs and Novel Targets Via Instance-Neighborhood-Based Models.

Optimized Conformal Selection: Powerful Selective Inference After Conformity Score Optimization

Concepts and Applications of Conformal Prediction in Computational Drug Discovery

Robust inference via knockoffs

Consensus models for CDK5 inhibitors in silico and their application to inhibitor discovery

Feature selection and transduction for prediction of molecular bioactivity for drug design

Predictive validity in drug discovery: what it is, why it matters and how to improve it

Leveraging conformal prediction to annotate enzyme function space with limited false positives

A Flexible Approach for Predictive Biomarker Discovery

Panning for gold:‘model-X’knockoffs for high dimensional controlled variable selection

Low Data Drug Discovery with One-Shot Learning

Machine learning assisted hit prioritization for high throughput screening in drug discovery