Abstract:Contemporary research on machine learning (ML), especially the ones using natural language processing has unveiled a new paradigm of opportunities. The language like representation of molecules is used in deciphering complex patterns than can help predict molecular property and reaction outcome by leveraging state‐of‐the art ML models. Herein, we review various ML models built using chemical language models with emphasis on their efficacy in dealing with problems related to chemical science. Molecular properties and reactions form the foundation of chemical space. Over the years, innumerable molecules have been synthesized, a smaller fraction of them found immediate applications, while a larger proportion served as a testimony to creative and empirical nature of the domain of chemical science. With increasing emphasis on sustainable practices, it is desirable that a target set of molecules are synthesized preferably through a fewer empirical attempts instead of a larger library, to realize an active candidate. In this front, predictive endeavors using machine learning (ML) models built on available data acquire high timely significance. Prediction of molecular property and reaction outcome remain one of the burgeoning applications of ML in chemical science. Among several methods of encoding molecular samples for ML models, the ones that employ language like representations are gaining steady popularity. Such representations would additionally help adopt well‐developed natural language processing (NLP) models for chemical applications. Given this advantageous background, herein we describe several successful chemical applications of NLP focusing on molecular property and reaction outcome predictions. From relatively simpler recurrent neural networks (RNNs) to complex models like transformers, different network architecture have been leveraged for tasks such as de novo drug design, catalyst generation, forward and retro‐synthesis predictions. The chemical language model (CLM) provides promising avenues toward a broad range of applications in a time and cost‐effective manner. While we showcase an optimistic outlook of CLMs, attention is also placed on the persisting challenges in reaction domain, which would optimistically be addressed by advanced algorithms tailored to chemical language and with increased availability of high‐quality datasets.

Melting point prediction of organic molecules by deciphering the chemical structure into a natural language

Predicting Structure‐dependent Properties Directly from the Three Dimensional Molecular Images Via Convolutional Neural Networks

Can Large Language Models Empower Molecular Property Prediction?

Deciphering melting behaviors of energetic compounds using interpretable Machine learning for melt-castable applications

Large-scale chemical language representations capture molecular structure and properties

Understanding the language of molecules: Predicting pure component parameters for the PC-SAFT equation of state from SMILES

Absorption Distribution Metabolism Excretion and Toxicity Property Prediction Utilizing a Pre-Trained Natural Language Processing Model and Its Applications in Early-Stage Drug Development

Large language model for molecular chemistry

MolX: Enhancing Large Language Models for Molecular Learning with A Multi-Modal Extension

Interactive Molecular Discovery with Natural Language

Empirical Evidence for the Fragment level Understanding on Drug Molecular Structure of LLMs

Machine Learning Prediction of Nine Molecular Properties Based on the SMILES Representation of the QM9 Quantum-Chemistry Dataset.

Predicting Polymers’ Glass Transition Temperature by a Chemical Language Processing Model

Multi-modal Molecule Structure-text Model for Text-based Retrieval and Editing

A BERT-based pretraining model for extracting molecular structural information from a SMILES sequence

Pushing the Boundaries of Molecular Property Prediction for Drug Discovery with Multitask Learning BERT Enhanced by SMILES Enumeration

Machine learning assisted prediction of organic salt structure properties

Identification of optimal metal-organic frameworks by machine learning: Structure decomposition, feature integration, and predictive modeling

Molecular Property Prediction: A Multilevel Quantum Interactions Modeling Perspective

Benchmarking Large Language Models for Molecule Prediction Tasks

Advances in machine learning with chemical language models in molecular property and reaction outcome predictions