Scaffold Splits Overestimate Virtual Screening Performance

Qianrong Guo,Saiveth Hernandez-Hernandez,Pedro J Ballester

2024-06-30

Abstract:Virtual Screening (VS) of vast compound libraries guided by Artificial Intelligence (AI) models is a highly productive approach to early drug discovery. Data splitting is crucial for better benchmarking of such AI models. Traditional random data splits produce similar molecules between training and test sets, conflicting with the reality of VS libraries which mostly contain structurally distinct compounds. Scaffold split, grouping molecules by shared core structure, is widely considered to reflect this real-world scenario. However, here we show that the scaffold split also overestimates VS performance. The reason is that molecules with different chemical scaffolds are often similar, which hence introduces unrealistically high similarities between training molecules and test molecules following a scaffold split. Our study examined three representative AI models on 60 NCI-60 datasets, each with approximately 30,000 to 50,000 molecules tested on a different cancer cell line. Each dataset was split with three methods: scaffold, Butina clustering and the more accurate Uniform Manifold Approximation and Projection (UMAP) clustering. Regardless of the model, model performance is much worse with UMAP splits from the results of the 2100 models trained and evaluated for each algorithm and split. These robust results demonstrate the need for more realistic data splits to tune, compare, and select models for VS. For the same reason, avoiding the scaffold split is also recommended for other molecular property prediction problems. The code to reproduce these results is available at <a class="link-external link-https" href="https://github.com/ScaffoldSplitsOverestimateVS" rel="external noopener nofollow">this https URL</a>

Quantitative Methods,Artificial Intelligence,Computational Engineering, Finance, and Science,Machine Learning,Biomolecules

What problem does this paper attempt to address?

The paper primarily explores the impact of data splitting methods on model performance evaluation in Virtual Screening (VS), with a particular focus on the widely used scaffold split method. The paper points out that in the drug discovery process, using Artificial Intelligence (AI) models to guide the virtual screening of large compound libraries is an efficient approach. However, to reliably evaluate the performance of these AI models, adopting an appropriate data splitting strategy is crucial. Traditional random splitting can lead to similar molecules appearing in both the training and test sets, which does not align with real-world applications. Therefore, the scaffold split method was proposed to better simulate real-world scenarios by grouping molecules with the same core structure to construct the training and test sets. Although the scaffold split method is considered to more accurately reflect real situations, the authors of this paper found that this method actually overestimates the performance of models in virtual screening tasks. To address this issue, the authors proposed a splitting method that is closer to real-world conditions—clustering based on Uniform Manifold Approximation and Projection (UMAP). They compared the performance of three representative AI models (linear regression, random forest, and the pre-trained graph neural network model GEM) under different splitting methods through experiments. The results showed that under UMAP splitting, the performance of the models significantly declined, especially for the GEM model, whose hit rate and other key indicators were superior to the random forest model under the more realistic UMAP split. In summary, the problem the paper attempts to solve is: how to more accurately evaluate the true performance of AI models used for virtual screening, especially when facing the challenge of chemical space diversity. The study reveals the limitations of the existing scaffold split method by comparing different data splitting methods and proposes a splitting method that is closer to real-world application scenarios, thereby providing valuable references for the future development and evaluation of models.

Scaffold Splits Overestimate Virtual Screening Performance

UMAP-clustering split for rigorous evaluation of AI models for virtual screening on cancer cell lines

A Case-Based Meta-Learning Algorithm Boosts the Performance of Structure-Based Virtual Screening.

Comparative Analyses of Structural Features and Scaffold Diversity for Purchasable Compound Libraries

One size does not fit all: revising traditional paradigms for QSAR-based virtual screenings.

Virtual Drug Screen Schema Based on Multiview Similarity Integration and Ranking Aggregation

Assessing the Scaffold Diversity of Screening Libraries

On the Best Way to Cluster NCI-60 Molecules

SIMPD: an algorithm for generating simulated time splits for validating machine learning approaches

Beware of the Generic Machine Learning-Based Scoring Functions in Structure-Based Virtual Screening.

A comprehensive comparative assessment of 3D molecular similarity tools in ligand-based virtual screening

Development of a Method To Consistently Quantify the Structural Distance between Scaffolds and To Assess Scaffold Hopping Potential

Improved Scaffold Hopping in Ligand-based Virtual Screening Using Neural Representation Learning

AIScaffold: A Web-Based Tool for Scaffold Diversification Using Deep Learning

Establishing the foundations for a data-centric AI approach for virtual drug screening through a systematic assessment of the properties of chemical data

Consensus scoring criteria for improving enrichment in virtual screening

Combined strategies in structure-based virtual screening

RealVS: Toward Enhancing the Precision of Top Hits in Ligand-Based Virtual Screening of Drug Leads from Large Compound Databases

ScaffoldGVAE: scaffold generation and hopping of drug molecules via a variational autoencoder based on multi-view graph neural networks

An Improved Metric and Benchmark for Assessing the Performance of Virtual Screening Models

Scaffold-Induced Molecular Graph (SIMG): Effective Graph Sampling Methods for High-Throughput Computational Drug Discovery