Abstract:Molecular science is the core of chemistry, but also the basis of biology, materials, pharmacy and other disciplines. Traditional molecular science research was carried out by experimental or theoretical means, which is costly, timeconsuming and hard to solve the high complexity system. With the advent of the era of big data, data-driven innovation of artificial intelligence (AI) has become the fourth research paradigm after experiment, theory and simulation. With its fast and efficient data processing capabilities, data-driven machine learning has shown great potential in molecular science research. A standard machine learning workflow usually includes three steps: data set construction, molecular descriptor selection, and model building. Firstly, the quality, quantity, and diversity of available data impose an upper limit on the accuracy and generality of the model. In Chapter 1, we classify publicly available data sources according to the research categories of molecular science, including molecular basic information, combustion, atmosphere and interstellar, biological, protein, pharmaceutical and organic. The contents, characteristics and obtaining methods of each database are introduced in detail. Secondly, descriptors are the key step to connect machine learning and chemical science. A good molecular descriptor should satisfy at least three criteria: a unique description of a molecule, sensitive to target attributes and easy to obtain. At the same time, descriptors should have physical significance to help find the connotation of the model and realize the interpretation of the model. In Chapter 2, we introduce some widely used structural descriptors and physical or chemical properties descriptors. Then, once the data has been collected and represented with the appropriate descriptor, a model needs to be selected according to the data type and research question. In Chapter 3, supervised and unsupervised learning algorithms are introduced which are widely used in molecular science. After introducing three important steps of machine learning, Chapter 4 presents its application in several subarea of molecular science research. For example, in molecular property prediction, machine learning has been used to predict the atomization energy, polarizability, electron affinity energy, ionization energy, electronegativity, etc. In molecular design, variational auto-encoder (VAE), generative adversarial network (GAN) and reinforcement learning model are widely used, especially in reverse design of drug molecules. In chemical reaction, machine learning has been used to predict reaction barrier, reaction rate constant, quantum reaction rate and chemical reaction yield, and it has also made great strides in backward synthesis. In theoretical chemistry, by training the potential energy surface of molecules and materials, the structural evolution of materials and the prediction of chemical reactions are greatly accelerated. Furthermore, based on big data and AI, robotic chemists which can explore chemical reactivity and achieve automatic synthesis were developed. It is worth mentioning that Chinese scientists have made breakthroughs in this field. In summary, data-driven machine learning provides new tools for molecular science with clear rules and complex evolution. However, it also faces some challenges and opportunities. Firstly, the biggest challenge is the lack of highquality data sets. Here we propose three possible solutions: open sharing of data and models, popularization of electronic experimental record books, and construction of models that can fully mine information from small data sets. Secondly, how to build explicable machine model which has good fault tolerance and good transferable ability, and can reverse reconstruct our understanding of chemistry. As more and more scientists use data-driven machine learning models in molecular science, the underlying principles are clearer. The new research paradigm is changing the way we discover molecules and study molecular science, which could lead to disruptive new discoveries.

Small molecule machine learning: All models are wrong, some may not even be useful

Machine Learning Small Molecule Properties in Drug Discovery

Machine learning of molecular properties: locality and active learning

Controlled exploration of chemical space by machine learning of coarse-grained representations

Combating small molecule aggregation with machine learning

Machine learning for small molecule drug discovery in academia and industry

Data intelligence for molecular science

Exposing the Limitations of Molecular Machine Learning with Activity Cliffs

Benchmarking Large Language Models for Molecule Prediction Tasks

Improving Measures of Chemical Structural Similarity Using Machine Learning on Chemical-Genetic Interactions

A Small Step Toward Generalizability: Training a Machine Learning Scoring Function for Structure-Based Virtual Screening

Prediction Errors of Molecular Machine Learning Models Lower Than Hybrid DFT Error.

Modern machine‐learning for binding affinity estimation of protein–ligand complexes: Progress, opportunities, and challenges

MoleculeNet: A Benchmark for Molecular Machine Learning

A physics-inspired approach to the understanding of molecular representations and models

Complex machine learning model needs complex testing: Examining predictability of molecular binding affinity by a graph neural network

Molecular machine learning with conformer ensembles

Large Language Models as Molecular Design Engines

A decision theoretic approach to model evaluation in computational drug discovery

Synergy of semiempirical models and machine learning in computational chemistry

Improved decision making with similarity based machine learning: Applications in chemistry