Language models in molecular discovery

Nikita Janakarajan,Tim Erdmann,Sarath Swaminathan,Teodoro Laino,Jannis Born
2023-09-28
Abstract:The success of language models, especially transformer-based architectures, has trickled into other domains giving rise to "scientific language models" that operate on small molecules, proteins or polymers. In chemistry, language models contribute to accelerating the molecule discovery cycle as evidenced by promising recent findings in early-stage drug discovery. Here, we review the role of language models in molecular discovery, underlining their strength in de novo drug design, property prediction and reaction chemistry. We highlight valuable open-source software assets thus lowering the entry barrier to the field of scientific language modeling. Last, we sketch a vision for future molecular design that combines a chatbot interface with access to computational chemistry tools. Our contribution serves as a valuable resource for researchers, chemists, and AI enthusiasts interested in understanding how language models can and will be used to accelerate chemical discovery.
Chemical Physics,Artificial Intelligence,Computation and Language,Machine Learning,Biomolecules
What problem does this paper attempt to address?
The paper primarily explores the application and potential of language models in the field of molecular discovery, aiming to address the long-standing issues of high resource costs and lengthy R&D cycles in the chemical industry. Specifically, the paper attempts to solve the following key problems: 1. **Accelerating the Molecular Discovery Cycle**: By utilizing language models (LMs) to speed up the design process of small molecules such as drugs. Traditional methods rely on experimental trial and error, which is costly and time-consuming; leveraging language models can quickly generate numerous hypotheses and screen them computationally, thereby significantly accelerating the research progress. 2. **Improving Design Efficiency**: Using language models to process the sequence representation of molecules (such as SMILES strings) can learn complex patterns in molecular structures and customize designs based on desired functional properties, enabling a smooth and targeted exploration of the originally discrete molecular space. 3. **Building Interactive Design Tools**: By integrating natural language processing capabilities into language models, developing interfaces similar to chatbots allows chemists to express design goals in natural language and interact with intelligent systems to iteratively optimize design schemes, thereby completing complex chemical tasks more efficiently. 4. **Conditional Generation and Property Prediction**: Developing conditional generation models to create molecules with specific properties, and combining them with molecular property prediction models to ensure that the generated molecules not only theoretically meet the requirements but also exhibit expected performance in practical applications. 5. **Software Tools and Platform Development**: To lower the entry barrier to this field, the paper also introduces various open-source software tools and platforms, including libraries for training and applying language models (such as GT4SD, RXN for Chemistry, etc.), as well as tools specifically for molecular property prediction and data processing. In summary, the paper aims to demonstrate how language models can play a role in the field of molecular discovery, especially in new drug design, providing chemists and AI researchers with an effective way to accelerate the molecular discovery process through theoretical and technological innovations.