Large language models design sequence-defined macromolecules via evolutionary optimization

Wesley Reinhart,Antonia Statt
DOI: https://doi.org/10.26434/chemrxiv-2024-9146h
2024-09-04
Abstract:We demonstrate the ability of a large language model to perform evolutionary optimization for materials discovery. Anthropic's Claude 3.5 model outperforms an active learning scheme with handcrafted surrogate models and an evolutionary algorithm in selecting monomer sequences to produce targeted morphologies in macromolecular self-assembly. Utilizing pre-trained language models can potentially reduce the need for hyperparameter tuning while offering new capabilities such as self-reflection. The model performs this task effectively with or without context about the task itself, but domain-specific context sometimes results in faster convergence to good solutions. Furthermore, when this context is withheld, the model infers an approximate notion of the task (e.g., calling it a protein folding problem). This work provides evidence of Claude 3.5's ability to act as an evolutionary optimizer, a recently discovered emergent behavior of large language models, and demonstrates a practical use case in the study and design of soft materials.
Chemistry
What problem does this paper attempt to address?
The problem that this paper attempts to solve is in materials science, how to design sequence - defined macromolecules with specific morphologies through evolutionary optimization methods. Specifically, the author uses large - language models (such as Anthropic's Claude 3.5 model) to select monomer sequences in order to produce macromolecule self - assembly structures with target morphologies. This process is similar to the inverse protein - folding problem, that is, designing sequences according to the required final structures instead of predicting their structures from known sequences. The main contribution of the paper lies in showing that a fully pre - trained language model can effectively perform evolutionary optimization tasks without fine - tuning with domain - specific data, and outperforms traditional active - learning schemes and evolutionary algorithms in performance. This indicates that large - language models can not only handle natural - language tasks but also be applied to complex technical problems, such as optimization problems in materials design. In addition, the study also explored the impact of providing domain - specific context on model performance and found that although it sometimes accelerates convergence to good solutions, it does not always improve the final overall performance.