Language models can generate molecules, materials, and protein binding sites directly in three dimensions as XYZ, CIF, and PDB files

Daniel Flam-Shepherd,Alán Aspuru-Guzik
2023-05-10
Abstract:Language models are powerful tools for molecular design. Currently, the dominant paradigm is to parse molecular graphs into linear string representations that can easily be trained on. This approach has been very successful, however, it is limited to chemical structures that can be completely represented by a graph -- like organic molecules -- while materials and biomolecular structures like protein binding sites require a more complete representation that includes the relative positioning of their atoms in space. In this work, we show how language models, without any architecture modifications, trained using next-token prediction -- can generate novel and valid structures in three dimensions from various substantially different distributions of chemical structures. In particular, we demonstrate that language models trained directly on sequences derived directly from chemical file formats like XYZ files, Crystallographic Information files (CIFs), or Protein Data Bank files (PDBs) can directly generate molecules, crystals, and protein binding sites in three dimensions. Furthermore, despite being trained on chemical file sequences -- language models still achieve performance comparable to state-of-the-art models that use graph and graph-derived string representations, as well as other domain-specific 3D generative models. In doing so, we demonstrate that it is not necessary to use simplified molecular representations to train chemical language models -- that they are powerful generative models capable of directly exploring chemical space in three dimensions for very different structures.
Machine Learning,Quantitative Methods
What problem does this paper attempt to address?
The paper primarily explores the potential applications of language models in generating 3D chemical structures, including molecules, materials, and protein binding sites. Traditional approaches often simplify molecular graphs into linear string representations, such as SMILES or SELFIES, to facilitate machine learning model processing. However, this method fails to fully capture the relative positional information of atoms in 3D space, which is crucial for many molecular design tasks (e.g., catalysis). The authors propose a novel approach that directly utilizes language models to generate 3D chemical structure data in formats such as XYZ files, Crystallographic Information Files (CIF), and Protein Data Bank files (PDB). These files contain atomic coordinates and other relevant information, providing a complete description of a molecule or material's 3D structure. The researchers trained Transformer-based language models to predict the next character or coordinate value in these file formats, thereby generating new 3D structures. Specifically, the work includes the following key points: 1. **Model and Training**: The authors demonstrate how to train language models using two different tokenization strategies (character-level and atom+coordinate-level) and explain how data augmentation through rotation of training data can improve model performance. 2. **Molecule Generation**: In terms of molecule generation, the authors used molecules from the ZINC dataset as a benchmark, showing that the language model can generate new molecules with reasonable 3D geometries. The performance was comparable to or even better than existing generation models based on graph and string representations. 3. **Crystal Generation**: For materials like crystals that cannot be represented by graphs, the authors tested using the Perov5 and MP20 datasets. The results showed that the language model could generate structurally valid new crystal materials and performed well on multiple evaluation metrics. 4. **Protein Binding Site Generation**: In the most complex task of generating biological molecular structures, the authors demonstrated that the language model could generate 3D structures of protein binding sites containing hundreds of atoms. Although the evaluation criteria for this part differ from those for small molecule drugs, the results indicate that the language model can generate new protein pockets with similar geometric structures. In summary, the paper demonstrates that language models can generate various types of 3D chemical structures without architectural modifications and achieve performance levels comparable to generation models specifically designed for 3D structure design. This provides a powerful tool for chemical space exploration and suggests broad future applications in the fields of molecular and material design.