Abstract:High-throughput screening (HTS), as one of the key techniques in drug discovery, is frequently used to identify promising drug candidates in a largely automated and cost-effective way. One of the necessary conditions for successful HTS campaigns is a large and diverse compound library, enabling hundreds of thousands of activity measurements per project. Such collections of data hold great promise for computational and experimental drug discovery efforts, especially when leveraged in combination with modern deep learning techniques, and can potentially lead to improved drug activity predictions and cheaper and more effective experimental design. However, existing collections of machine-learning-ready public datasets do not exploit the multiple data modalities present in real-world HTS projects. Thus, the largest fraction of experimental measurements, corresponding to hundreds of thousands of "noisy" activity values from primary screening, are effectively ignored in the majority of machine learning models of HTS data. To address these limitations, we introduce Multifidelity PubChem BioAssay (MF-PCBA), a curated collection of 60 datasets that includes two data modalities for each dataset, corresponding to primary and confirmatory screening, an aspect that we call <i>multifidelity</i>. Multifidelity data accurately reflect real-world HTS conventions and present a new, challenging task for machine learning: the integration of low- and high-fidelity measurements through molecular representation learning, taking into account the orders-of-magnitude difference in size between the primary and confirmatory screens. Here we detail the steps taken to assemble MF-PCBA in terms of data acquisition from PubChem and the filtering steps required to curate the raw data. We also provide an evaluation of a recent deep-learning-based method for multifidelity integration across the introduced datasets, demonstrating the benefit of leveraging all HTS modalities, and a discussion in terms of the roughness of the molecular activity landscape. In total, MF-PCBA contains over 16.6 million unique molecule-protein interactions. The datasets can be easily assembled by using the source code available at https://github.com/davidbuterez/mf-pcba.

Machine Learning-Driven Data Valuation for Optimizing High-Throughput Screening Pipelines

Data Valuation: A novel approach for analyzing high throughput screen data using machine learning

Machine Learning Assisted Hit Prioritization for High Throughput Screening in Drug Discovery

Data-driven approaches used for compound library design, hit triage and bioactivity modeling in high-throughput screening

A deep-learning based analysis framework for ultra-high throughput screening time-series data

Machine Learning-Enabled Pipeline for Large-Scale Virtual Drug Screening

Endless Data for Drug Discovery Pipeline Validation for Free – Computational Chemistry’s Gift

Deep Learning in Virtual Screening: Recent Applications and Developments

Identifying Actives from HTS Data Sets: Practical Approaches for the Selection of an Appropriate HTS Data-Processing Method and Quality Control Review

Applications of machine learning in drug discovery and development

MF-PCBA: Multifidelity High-Throughput Screening Benchmarks for Drug Discovery and Machine Learning

Mitigating Molecular Aggregation in Drug Discovery with Predictive Insights from Explainable AI

Enhanced Sampling of Chemical Space for High Throughput Screening Applications using Machine Learning

A decision-theoretic approach to the evaluation of machine learning algorithms in computational drug discovery

AMPL: A Data-Driven Modeling Pipeline for Drug Discovery

Machine learning in preclinical drug discovery

Practical Applications of Deep Learning To Impute Heterogeneous Drug Discovery Data

Novel Big Data-Driven Machine Learning Models for Drug Discovery Application

Deep Learning-Based Imbalanced Data Classification for Drug Discovery