Bioptic -- A Target-Agnostic Potency-Based Small Molecules Search Engine

Vlad Vinogradov,Ivan Izmailov,Simon Steshin,Kong T. Nguyen
2024-07-01
Abstract:Recent successes in virtual screening have been made possible by large models and extensive chemical libraries. However, combining these elements is challenging: the larger the model, the more expensive it is to run, making ultra-large libraries unfeasible. To address this, we developed a target-agnostic, efficacy-based molecule search model, which allows us to find structurally dissimilar molecules with similar biological activities. We used the best practices to design fast retrieval system, based on processor-optimized SIMD instructions, enabling us to screen the ultra-large 40B Enamine REAL library with 100\% recall rate. We extensively benchmarked our model and several state-of-the-art models for both speed performance and retrieval quality of novel molecules.
Quantitative Methods,Artificial Intelligence,Information Retrieval
What problem does this paper attempt to address?
The paper proposes a solution to the problem of inefficient search in virtual screening of large-scale molecular libraries. Traditional drug discovery processes involve multiple stages, and virtual screening is a step that uses statistical algorithms to screen potential active molecules from a large number of molecules. With the advancement of chemical synthesis methods and automation technology, huge molecular libraries containing billions of molecules have emerged, but the high running cost of these large models makes screening of ultra-large-scale libraries impractical. To address this issue, the research team developed a target-agnostic, efficacy-based molecular search model that can find structurally different but biologically similar molecules. They designed a fast retrieval system based on best practices, utilizing processor-optimized SIMD instructions, achieving efficient screening with 100% recall rate on a billion-scale Enamine REAL molecular library. The paper also compares their model with other state-of-the-art models such as Deep Docking, DrugClip, and Chemprop in terms of speed performance and retrieval quality, highlighting the global, target-agnostic nature of their model, which can simultaneously search for activity-similar molecules for all possible targets without the need for retraining for each target. In addition, the paper discusses the impact of query selection strategy on model performance and demonstrates the speed performance in handling ultra-large molecular libraries. By using GPU for preprocessing and CPU for search, their system is able to search libraries containing billions of molecules in seconds. Overall, this paper aims to improve the efficiency of virtual screening by applying best practices of recommendation systems and search engines to accelerate the drug discovery process, especially in dealing with large-scale molecular libraries.