Data-centric challenges with the application and adoption of artificial intelligence for drug discovery

Ghita Ghislat,Saiveth Hernandez-Hernandez,Chayanit Piyawajanusorn,Pedro J. Ballester
2024-09-25
Abstract:Introduction: Artificial intelligence (AI) is exhibiting tremendous potential to reduce the massive costs and long timescales of drug discovery. There are however important challenges currently limiting the impact and scope of AI models. Areas covered: In this perspective, the authors discuss a range of data issues (bias, inconsistency, skewness, irrelevance, small size, high dimensionality), how they challenge AI models, and which issue-specific mitigations have been effective. Next, they point out the challenges faced by uncertainty quantification techniques aimed at enhancing and trusting the predictions from these AI models. They also discuss how conceptual errors, unrealistic benchmarks and performance misestimation can confound the evaluation of models and thus their development. Lastly, the authors explain how human bias, whether from AI experts or drug discovery experts, constitutes another challenge that can be alleviated by gaining more prospective experience. Expert opinion: AI models are often developed to excel on retrospective benchmarks unlikely to anticipate their prospective performance. As a result, only a few of these models are ever reported to have prospective value (e.g. by discovering potent and innovative drug leads for a therapeutic target). The authors have discussed what can go wrong in practice with AI for drug discovery. We hope that this will help inform the decisions of editors, funders investors and researchers working in this area.
Other Quantitative Biology
What problem does this paper attempt to address?
The problems that this paper attempts to solve are key issues such as data challenges, uncertainty quantification, model evaluation, and researcher bias in the application of artificial intelligence (AI) in drug discovery. Specifically: 1. **Data problems**: - **Biased data**: Even under ideal circumstances, instances in a dataset may sample the label distribution unevenly, resulting in a trained model with poor generalization ability to unseen regions. - **Inconsistent data**: Data generated by different laboratories may lead to poor model generalization ability due to differences in device calibration or sample preparation methods. - **Skewed data**: Especially in early - stage drug discovery, the frequency of active molecules (the minority class) is much lower than that of inactive molecules (the majority class), resulting in an unbalanced dataset. - **Irrelevant data**: When selecting features, features that are irrelevant to prediction may be included, affecting model performance. - **Small - sized data**: Insufficient sample quantity makes it difficult for supervised learning algorithms to accurately predict other samples. - **High - dimensional data**: In biomarker discovery, the number of features far exceeds the number of samples, increasing the difficulty of model generalization. 2. **Uncertainty quantification**: - Quantifying the uncertainty of prediction, that is, the reliability of prediction, is crucial for decision - making. For example, using Gaussian processes (GP) or conformal prediction (CP) to estimate the uncertainty of prediction can help screen out more reliable molecules. 3. **Model evaluation**: - **Conceptual errors**: For example, the concept of over - fitting is often misinterpreted. It is considered that if a model performs well on the training set but poorly on the test set, then the model is not trustworthy. - **Performance misestimation**: Using inappropriate metrics or benchmarks to evaluate model performance may lead to performance misestimation. For example, ROC - AUC is not a suitable metric in highly unbalanced datasets. - **Unrealistic benchmarks**: Many benchmark tests are too idealized to truly reflect the performance of a model in practical applications. 4. **Researchers' bias**: - **Bias of AI experts**: AI experts tend to think that any problem can be solved by the correct learning algorithm, but lack an understanding of domain knowledge, leading to over - hyping of AI applications. - **Bias of drug discovery experts**: Experts in the field of drug discovery are often skeptical about AI applications, fearing that their work will become unimportant or redundant. This defensive attitude will actually exacerbate future uncertainties. By discussing these issues, the author hopes to provide guidance for editors, funders, investors, and researchers to better understand and address the challenges of AI in drug discovery.