Data intelligence for molecular science
Yanbo Li,Jun Jiang,Yi Luo
DOI: https://doi.org/10.1360/TB-2022-1152
2023-01-01
Abstract:Molecular science is the core of chemistry, but also the basis of biology, materials, pharmacy and other disciplines. Traditional molecular science research was carried out by experimental or theoretical means, which is costly, timeconsuming and hard to solve the high complexity system. With the advent of the era of big data, data-driven innovation of artificial intelligence (AI) has become the fourth research paradigm after experiment, theory and simulation. With its fast and efficient data processing capabilities, data-driven machine learning has shown great potential in molecular science research. A standard machine learning workflow usually includes three steps: data set construction, molecular descriptor selection, and model building. Firstly, the quality, quantity, and diversity of available data impose an upper limit on the accuracy and generality of the model. In Chapter 1, we classify publicly available data sources according to the research categories of molecular science, including molecular basic information, combustion, atmosphere and interstellar, biological, protein, pharmaceutical and organic. The contents, characteristics and obtaining methods of each database are introduced in detail. Secondly, descriptors are the key step to connect machine learning and chemical science. A good molecular descriptor should satisfy at least three criteria: a unique description of a molecule, sensitive to target attributes and easy to obtain. At the same time, descriptors should have physical significance to help find the connotation of the model and realize the interpretation of the model. In Chapter 2, we introduce some widely used structural descriptors and physical or chemical properties descriptors. Then, once the data has been collected and represented with the appropriate descriptor, a model needs to be selected according to the data type and research question. In Chapter 3, supervised and unsupervised learning algorithms are introduced which are widely used in molecular science. After introducing three important steps of machine learning, Chapter 4 presents its application in several subarea of molecular science research. For example, in molecular property prediction, machine learning has been used to predict the atomization energy, polarizability, electron affinity energy, ionization energy, electronegativity, etc. In molecular design, variational auto-encoder (VAE), generative adversarial network (GAN) and reinforcement learning model are widely used, especially in reverse design of drug molecules. In chemical reaction, machine learning has been used to predict reaction barrier, reaction rate constant, quantum reaction rate and chemical reaction yield, and it has also made great strides in backward synthesis. In theoretical chemistry, by training the potential energy surface of molecules and materials, the structural evolution of materials and the prediction of chemical reactions are greatly accelerated. Furthermore, based on big data and AI, robotic chemists which can explore chemical reactivity and achieve automatic synthesis were developed. It is worth mentioning that Chinese scientists have made breakthroughs in this field. In summary, data-driven machine learning provides new tools for molecular science with clear rules and complex evolution. However, it also faces some challenges and opportunities. Firstly, the biggest challenge is the lack of highquality data sets. Here we propose three possible solutions: open sharing of data and models, popularization of electronic experimental record books, and construction of models that can fully mine information from small data sets. Secondly, how to build explicable machine model which has good fault tolerance and good transferable ability, and can reverse reconstruct our understanding of chemistry. As more and more scientists use data-driven machine learning models in molecular science, the underlying principles are clearer. The new research paradigm is changing the way we discover molecules and study molecular science, which could lead to disruptive new discoveries.