NEBULA: Neural Empirical Bayes Under Latent Representations for Efficient and Controllable Design of Molecular Libraries

Ewa M. Nowara,Pedro O. Pinheiro,Sai Pooja Mahajan,Omar Mahmood,Andrew Martin Watkins,Saeed Saremi,Michael Maser
2024-07-04
Abstract:We present NEBULA, the first latent 3D generative model for scalable generation of large molecular libraries around a seed compound of interest. Such libraries are crucial for scientific discovery, but it remains challenging to generate large numbers of high quality samples efficiently. 3D-voxel-based methods have recently shown great promise for generating high quality samples de novo from random noise (Pinheiro et al., 2023). However, sampling in 3D-voxel space is computationally expensive and use in library generation is prohibitively slow. Here, we instead perform neural empirical Bayes sampling (Saremi & Hyvarinen, 2019) in the learned latent space of a vector-quantized variational autoencoder. NEBULA generates large molecular libraries nearly an order of magnitude faster than existing methods without sacrificing sample quality. Moreover, NEBULA generalizes better to unseen drug-like molecules, as demonstrated on two public datasets and multiple recently released drugs. We expect the approach herein to be highly enabling for machine learning-based drug discovery. The code is available at <a class="link-external link-https" href="https://github.com/prescient-design/nebula" rel="external noopener nofollow">this https URL</a>
Machine Learning
What problem does this paper attempt to address?
The paper proposes a new method called NEBULA for efficient and controlled generation of large-scale molecular libraries, especially in the case of given seed compounds. Currently, methods for generating high-quality molecular samples face challenges in efficiency. Although 3D voxel-based methods such as VoxMol can generate high-quality samples, their sampling process is computationally expensive and not suitable for rapid generation of large libraries. NEBULA performs sampling in the low-dimensional latent space of voxelized molecules using a neural experience-based Bayesian (NEB) approach, significantly reducing generation costs, improving control over sample generation, and maintaining high quality. Compared to direct sampling in 3D voxel space, this method improves speed by approximately an order of magnitude. Furthermore, NEBULA demonstrates better generalization capabilities on unseen drug-like molecules. The main contributions of the paper are as follows: 1. NEBULA is a latent space-based generative model that can generate new molecules approximately one order of magnitude faster than existing 3D generative models while maintaining high quality. 2. NEBULA is scalable to generate very large molecular libraries, producing stable, unique, and valid (SUV) molecules. 3. NEBULA exhibits better generalization performance in new chemical spaces, including recently released drugs, compared to existing techniques. The working principle of NEBULA is compressing the molecular voxel representation into a low-dimensional latent space and then sampling around the noisy latent vector. Efficient generation of new molecules in the latent space is achieved by utilizing Langevin Markov Chain Monte Carlo (MCMC) and Walk-Jump Sampling (WJS) methods. Experiments show that NEBULA can rapidly generate stable and effective molecules while maintaining similarity to the seed molecules, and it outperforms the current state-of-the-art VoxMol method in generalization performance across datasets. Therefore, NEBULA has the potential to accelerate the machine learning-driven drug discovery process.