Robust Model-Based Optimization for Challenging Fitness Landscapes

Saba Ghaffari,Ehsan Saleh,Alexander G. Schwing,Yu-Xiong Wang,Martin D. Burke,Saurabh Sinha
2024-06-28
Abstract:Protein design, a grand challenge of the day, involves optimization on a fitness landscape, and leading methods adopt a model-based approach where a model is trained on a training set (protein sequences and fitness) and proposes candidates to explore next. These methods are challenged by sparsity of high-fitness samples in the training set, a problem that has been in the literature. A less recognized but equally important problem stems from the distribution of training samples in the design space: leading methods are not designed for scenarios where the desired optimum is in a region that is not only poorly represented in training data, but also relatively far from the highly represented low-fitness regions. We show that this problem of "separation" in the design space is a significant bottleneck in existing model-based optimization tools and propose a new approach that uses a novel VAE as its search model to overcome the problem. We demonstrate its advantage over prior methods in robustly finding improved samples, regardless of the imbalance and separation between low- and high-fitness samples. Our comprehensive benchmark on real and semi-synthetic protein datasets as well as solution design for physics-informed neural networks, showcases the generality of our approach in discrete and continuous design spaces. Our implementation is available at <a class="link-external link-https" href="https://github.com/sabagh1994/PGVAE" rel="external noopener nofollow">this https URL</a>.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?