Abstract:Molecular science is the core of chemistry, but also the basis of biology, materials, pharmacy and other disciplines. Traditional molecular science research was carried out by experimental or theoretical means, which is costly, timeconsuming and hard to solve the high complexity system. With the advent of the era of big data, data-driven innovation of artificial intelligence (AI) has become the fourth research paradigm after experiment, theory and simulation. With its fast and efficient data processing capabilities, data-driven machine learning has shown great potential in molecular science research. A standard machine learning workflow usually includes three steps: data set construction, molecular descriptor selection, and model building. Firstly, the quality, quantity, and diversity of available data impose an upper limit on the accuracy and generality of the model. In Chapter 1, we classify publicly available data sources according to the research categories of molecular science, including molecular basic information, combustion, atmosphere and interstellar, biological, protein, pharmaceutical and organic. The contents, characteristics and obtaining methods of each database are introduced in detail. Secondly, descriptors are the key step to connect machine learning and chemical science. A good molecular descriptor should satisfy at least three criteria: a unique description of a molecule, sensitive to target attributes and easy to obtain. At the same time, descriptors should have physical significance to help find the connotation of the model and realize the interpretation of the model. In Chapter 2, we introduce some widely used structural descriptors and physical or chemical properties descriptors. Then, once the data has been collected and represented with the appropriate descriptor, a model needs to be selected according to the data type and research question. In Chapter 3, supervised and unsupervised learning algorithms are introduced which are widely used in molecular science. After introducing three important steps of machine learning, Chapter 4 presents its application in several subarea of molecular science research. For example, in molecular property prediction, machine learning has been used to predict the atomization energy, polarizability, electron affinity energy, ionization energy, electronegativity, etc. In molecular design, variational auto-encoder (VAE), generative adversarial network (GAN) and reinforcement learning model are widely used, especially in reverse design of drug molecules. In chemical reaction, machine learning has been used to predict reaction barrier, reaction rate constant, quantum reaction rate and chemical reaction yield, and it has also made great strides in backward synthesis. In theoretical chemistry, by training the potential energy surface of molecules and materials, the structural evolution of materials and the prediction of chemical reactions are greatly accelerated. Furthermore, based on big data and AI, robotic chemists which can explore chemical reactivity and achieve automatic synthesis were developed. It is worth mentioning that Chinese scientists have made breakthroughs in this field. In summary, data-driven machine learning provides new tools for molecular science with clear rules and complex evolution. However, it also faces some challenges and opportunities. Firstly, the biggest challenge is the lack of highquality data sets. Here we propose three possible solutions: open sharing of data and models, popularization of electronic experimental record books, and construction of models that can fully mine information from small data sets. Secondly, how to build explicable machine model which has good fault tolerance and good transferable ability, and can reverse reconstruct our understanding of chemistry. As more and more scientists use data-driven machine learning models in molecular science, the underlying principles are clearer. The new research paradigm is changing the way we discover molecules and study molecular science, which could lead to disruptive new discoveries.

Development and application of intelligent data processing software ChemDataSolution for modeling of multivariate data obtained from scientific instruments

Application of a genomic model for high-dimensional chemometric analysis

Development and application of data mining system for chemical process

Applied Research on Data Mining Optimization and Monitoring System for Ammonia Synthesis Process

Software package “Materials Designer” and its application in materials research

Automated Experimentation Powers Data Science in Chemistry.

Data flow modeling, data mining and QSAR in high-throughput discovery of functional nanomaterials

Design of High Level Semantic Data Model for Laser System and Tool Developing

Data intelligence for molecular science

ChemDFM-X: Towards Large Multimodal Model for Chemistry

Research on Chemometrics Software Based on Parallel PLS

MODEL—molecular Descriptor Lab: A Web‐based Server for Computing Structural and Physicochemical Features of Compounds

Automics: an integrated platform for NMR-based metabonomics spectral processing and data analysis

Scientific data organization and management in numerical simulations

ChemSuite: A package for chemoinformatics calculations and machine learning

Overexpression of the Disease Resistance Gene Pto in Tomato Induces Gene Expression Changes Similar to Immune Responses in Human and Fruitfly1[w]

Analysis of extended X-ray absorption fine structure (EXAFS) data using artificial intelligence techniques

A review of Innovative Chemical Drawing and Spectra Prediction Computer Software

Met4DX: A Unified and Versatile Data Processing Tool for Multidimensional Untargeted Metabolomics Data

The Application of Multilinear Regression Model for Quantitative Analysis on the Basis of Excitation-Emission Matrix Spectra and the Release of a Free Graphical User Interface

A robotic AI-Chemist system for multi-modal AI-ready database