Large language model for molecular chemistry

Jie Pan
DOI: https://doi.org/10.1038/s43588-023-00399-1
2023-01-24
Nature Computational Science
Abstract:Machine learning (ML) has disruptively changed the way scientists predict molecular structure and properties that are relevant to chemical and materials design. Graph neural networks (GNNs) are an example of ML models that have shown great promise in such tasks. However, the success of GNNs in molecular prediction relies on a supervised training strategy, which requires a large amount of labeled data: label annotation of molecules is time consuming, and more importantly, it can become impractical given the vast chemical space. Task-agnostic transformer-based language models are a promising alternative for learning from unlabeled corpora, but the string-based representations that they often use, such as SMILES (simplified molecular-input line-entry system), do not contain precise topological information — unlike GNNs, for instance — which limits the prediction accuracy in language models. In a recent study, Jerret Ross, Payel Das and colleagues introduce a large-scale transformer-based language model with relative position embedding that enables the encoding of spatial information in molecules.
What problem does this paper attempt to address?