LLMs are Highly-Constrained Biophysical Sequence Optimizers

Angelica Chen,Samuel D. Stanton,Robert G. Alberstein,Andrew M. Watkins,Richard Bonneau,Vladimir Gligorijevi,Kyunghyun Cho,Nathan C. Frey
2024-11-01
Abstract:Large language models (LLMs) have recently shown significant potential in various biological tasks such as protein engineering and molecule design. These tasks typically involve black-box discrete sequence optimization, where the challenge lies in generating sequences that are not only biologically feasible but also adhere to hard fine-grained constraints. However, LLMs often struggle with such constraints, especially in biological contexts where verifying candidate solutions is costly and time-consuming. In this study, we explore the possibility of employing LLMs as highly-constrained bilevel optimizers through a methodology we refer to as Language Model Optimization with Margin Expectation (LLOME). This approach combines both offline and online optimization, utilizing limited oracle evaluations to iteratively enhance the sequences generated by the LLM. We additionally propose a novel training objective -- Margin-Aligned Expectation (MargE) -- that trains the LLM to smoothly interpolate between the reward and reference distributions. Lastly, we introduce a synthetic test suite that bears strong geometric similarity to real biophysical problems and enables rapid evaluation of LLM optimizers without time-consuming lab validation. Our findings reveal that, in comparison to genetic algorithm baselines, LLMs achieve significantly lower regret solutions while requiring fewer test function evaluations. However, we also observe that LLMs exhibit moderate miscalibration, are susceptible to generator collapse, and have difficulty finding the optimal solution when no explicit ground truth rewards are available.
Machine Learning,Quantitative Methods
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the challenges encountered by large - language models (LLMs) in biophysical sequence optimization tasks, especially when these tasks involve strict fine - grained constraints. Specifically: 1. **Generate biophysical sequences that meet strict constraints**: - Biophysical optimization tasks (such as protein engineering and molecular design) usually need to generate discrete sequences \(x\in X\), which not only need to be biologically feasible but also need to meet specific constraint conditions (for example, containing specific motifs). However, existing LLMs perform poorly when dealing with such strict constraints, especially in the biological context, where the cost of validating candidate solutions is high and time - consuming. 2. **Combine offline and online optimization**: - In the real world, laboratory verification usually can only provide a limited number of oracle labels (i.e., real evaluation results). Therefore, the optimizer must be able to generate, rank, and screen candidate sequences without immediate access to the oracle. Most existing methods only use oracle labels or simulation - based metrics, while the method proposed in this paper combines offline and online optimization to be closer to the actual application scenario. 3. **Introduce a new training objective**: - In order to make LLMs better adapt to this highly - constrained optimization problem, the authors propose a new training objective - Margin - Aligned Expectation (MargE). This objective smoothes the gap between the interpolation target and the reference distribution, enabling LLMs to find the optimal solution more effectively. 4. **Create a synthetic test suite**: - In order to quickly evaluate the performance of LLM optimizers without relying on time - consuming laboratory verification, the authors design a synthetic test suite with strong geometric similarity. This test suite reflects the non - additive and epitope properties of real biophysical optimization problems, facilitating rapid evaluation without experimental verification. ### Main contributions 1. **Synthetic test suite**: Designed a set of closed - form test functions for evaluating highly - constrained biophysical sequence optimization problems. 2. **Explore the ability of LLMs as constrained bilevel optimizers**: Proposed the LLOME (Language Model Optimization with Margin Expectation) method, embedding LLMs into a bilevel optimization loop, showing that LLMs can produce solutions with lower regret compared to the evolutionary algorithm baseline given a fixed evaluation budget. 3. **New LLM training objective (MargE)**: Proposed a new training objective, enabling LLMs to use reward margin information more effectively during the training process, thus finding the optimal solution more quickly. Through these methods, the paper hopes to improve the performance of LLMs in biophysical optimization tasks, especially in the face of strict constraints and sparse feedback.