Enhanced Thompson Sampling by Roulette Wheel Selection for Screening Ultra-Large Combinatorial Libraries

Hongtao Zhao,Eva Nittinger,Christian Tyrchan
DOI: https://doi.org/10.1101/2024.05.16.594622
2024-05-21
Abstract:Chemical space exploration has gained significant interest with the increase in available building blocks, which enables the creation of ultra-large virtual libraries containing billions or even trillions of compounds. However, the challenge of selecting most suitable compounds for synthesis arises, and one such challenge is hit expansion. Recently, Thompson sampling, a probabilistic search approach, has been proposed by Walters et al. to achieve efficiency gains by operating in the reagent space rather than the product space. Here, we aim to address some of its shortcomings and propose optimizations. We introduce a warmup routine to ensure that initial probabilities are set for all reagents with a minimum number of molecules evaluated. Additionally, a roulette wheel selection is proposed with adapted stop criteria to improve sampling efficiency, and belief distributions of reagents are only updated when they appear in new molecules. We demonstrate that a 100% recovery rate can be achieved by sampling 0.1% of the fully enumerated library, showcasing the effectiveness of our proposed optimizations.
Bioinformatics
What problem does this paper attempt to address?
The main objective of this paper is to improve the efficiency of the Thompson sampling method in screening ultra-large combinatorial libraries. Specifically, the authors propose optimization schemes to address some shortcomings of the Thompson sampling method. The main improvements include: 1. **Introduction of a Warmup Mechanism**: Ensuring that each reagent is evaluated at least once during the initial phase to avoid the situation where some reagents are never selected. 2. **Roulette Wheel Selection**: Using roulette wheel selection to replace the original greedy selection strategy to improve sampling efficiency. By adjusting the stopping criteria and updating the belief distribution of reagents only in new molecules, better sampling results are achieved. Through these improvements, the researchers demonstrated that a 100% hit rate can be achieved by screening only 0.1% of the fully enumerated library, significantly outperforming the original method. This approach is particularly effective in hit expansion tasks in drug discovery, enabling the identification of high-quality compounds with a very small sample size.