Abstract:Contemporary research on machine learning (ML), especially the ones using natural language processing has unveiled a new paradigm of opportunities. The language like representation of molecules is used in deciphering complex patterns than can help predict molecular property and reaction outcome by leveraging state‐of‐the art ML models. Herein, we review various ML models built using chemical language models with emphasis on their efficacy in dealing with problems related to chemical science. Molecular properties and reactions form the foundation of chemical space. Over the years, innumerable molecules have been synthesized, a smaller fraction of them found immediate applications, while a larger proportion served as a testimony to creative and empirical nature of the domain of chemical science. With increasing emphasis on sustainable practices, it is desirable that a target set of molecules are synthesized preferably through a fewer empirical attempts instead of a larger library, to realize an active candidate. In this front, predictive endeavors using machine learning (ML) models built on available data acquire high timely significance. Prediction of molecular property and reaction outcome remain one of the burgeoning applications of ML in chemical science. Among several methods of encoding molecular samples for ML models, the ones that employ language like representations are gaining steady popularity. Such representations would additionally help adopt well‐developed natural language processing (NLP) models for chemical applications. Given this advantageous background, herein we describe several successful chemical applications of NLP focusing on molecular property and reaction outcome predictions. From relatively simpler recurrent neural networks (RNNs) to complex models like transformers, different network architecture have been leveraged for tasks such as de novo drug design, catalyst generation, forward and retro‐synthesis predictions. The chemical language model (CLM) provides promising avenues toward a broad range of applications in a time and cost‐effective manner. While we showcase an optimistic outlook of CLMs, attention is also placed on the persisting challenges in reaction domain, which would optimistically be addressed by advanced algorithms tailored to chemical language and with increased availability of high‐quality datasets.

CLAMP: A Contrastive Language And Molecule Pre-training Network

Molecular contrastive learning of representations via graph neural networks

Generative Chemical Transformer: Neural Machine Learning of Molecular Geometric Structures from Chemical Language via Attention

Chemical Language Models for Molecular Design

MoleMCL: a multi-level contrastive learning framework for molecular pre-training

MoCL: Data-driven Molecular Fingerprint via Knowledge-aware Contrastive Learning from Molecular Graph

Crystal Transformer: Self-learning neural language model for Generative and Tinkering Design of Materials

Molecular Graph Contrastive Learning with Parameterized Explainable Augmentations

Chemical Language Model Linker: blending text and molecules with modular adapters

Crystal Composition Transformer: Self-Learning Neural Language Model for Generative and Tinkering Design of Materials

Substrate Scope Contrastive Learning: Repurposing Human Bias to Learn Atomic Representations

Extracting Molecular Properties from Natural Language with Multimodal Contrastive Learning

Cross‐Modal Graph Contrastive Learning with Cellular Images

Bridging the Gap between Chemical Reaction Pretraining and Conditional Molecule Generation with a Unified Model

Molecule-Morphology Contrastive Pretraining for Transferable Molecular Representation

CasANGCL: pre-training and fine-tuning model based on cascaded attention network and graph contrastive learning for molecular property prediction

Improving machine learning performance on small chemical reaction data with unsupervised contrastive pretraining

Advances in machine learning with chemical language models in molecular property and reaction outcome predictions

COATI: Multimodal Contrastive Pretraining for Representing and Traversing Chemical Space

ACR-GNN: Adaptive Cluster Reinforcement Graph Neural Network Based on Contrastive Learning

Improving Molecular Contrastive Learning via Faulty Negative Mitigation and Decomposed Fragment Contrast