Data Valuation: A novel approach for analyzing high throughput screen data using machine learning

Joshua Hesse,Davide Boldini,Stephan Sieber
DOI: https://doi.org/10.26434/chemrxiv-2023-wlzlc
2023-12-12
Abstract:In the rapidly evolving field of drug discovery, High Throughput Screening (HTS) is a pivotal technique for identifying promising compounds. Despite its wide usage, the primary challenge remains in efficiently sifting through vast chemical libraries to discern true bioactive compounds from false positives. This study introduces a novel application of data valuation methods in machine learning to address this challenge, offering a multi-faceted approach to improving drug discovery pipelines. Our comprehensive strategy encompasses enhancing active learning for efficient compound library screening, robust identification of false and true positives in primary HTS data, and optimizing HTS datasets for machine learning applications through targeted undersampling. We demonstrate that influence-based methods enable more effective batch screening of chemical libraries, thereby reducing the need for extensive HTS, and provide significant advancements over current false positive detection techniques. This is achieved by employing machine learning models that accurately distinguish between true biological activity and assay artifacts, thereby streamlining the drug discovery process. Furthermore, our method applies smart undersampling to balance HTS datasets, enhancing the performance of machine learning algorithms without the risk of omitting crucial inactive samples. The implications of these developments are far-reaching, offering a potential paradigm shift in the efficiency and accuracy of drug development processes. We provide a benchmarking platform to facilitate the application of these methods, ensuring easy integration and modification for a broad range of datasets, thus propelling the scientific community towards more effective drug discovery methodologies (Available on GitHub at: https://github.com/JoshuaHesse/DataValuationPlatform).
Chemistry
What problem does this paper attempt to address?