Machine learning prediction of electronic molecular excited state properties

Itamar Borges Jr,Rubens Souza,Julio Duarte,Ronaldo Goldschmidt
DOI: https://doi.org/10.26434/chemrxiv-2024-xkt1j
2024-10-18
Abstract:Accurate knowledge of electronic molecular properties of excited states is fundamental for understanding the behavior of functional materials for organic electronics and sensors. In this work, we focus on determining the properties of the most intense peak in the electronic absorption spectra of organic molecules. For this purpose, we employed the quantum chemistry QM-symex dataset, which has approximately 173,000 organic molecules and time-dependent DFT (TD-DFT) data of the first ten electronic absorption transitions. Each one is identified by its Cartesian coordinates. From the original QM-symex, we built a new dataset named QM-symex-modif by converting the molecular Cartesian coordinates into the Simplified Molecular Input Line Entry System (SMILES) format, selecting the main transition orbitals of the singlet most intense absorption peak, their corresponding oscillator strengths and transition energies. We employed twenty machine learning (ML) algorithms to investigate these target properties plus the highest occupied molecular orbitals (HOMOs). As inputs for the ML algorithms, we employed several chemical descriptors generated in the RDKit tool for each molecule using the corresponding SMILES format. The QM-symex-modif dataset significantly improved the accuracy of ML predictions of these key photophysical properties. Very good mean absolute errors were obtained for the test set composed of 45,056 molecules. Additionally, a Shapley additive explanations (SHAP) analysis was carried out to evaluate the importance of the input parameters for the investigated ML models. We found several interesting relationships involving the input parameters. In particular, the molecular weight has enormous importance among several different descriptors in determining HOMO values and the transition orbitals.
Chemistry
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to predict the key properties of organic molecules in the electronically excited state through machine - learning methods, especially the properties of the most intense electronic absorption peak. Specifically, the research focuses on the main transition orbitals of the most intense absorption peaks in organic molecules, the corresponding oscillator strengths and transition energies, as well as the properties of the highest occupied molecular orbital (HOMO). These problems are crucial for understanding the behavior of functional materials in organic electronics and sensors. To achieve this goal, the authors utilized the QM - symex data set of time - dependent density functional theory (TD - DFT) data containing approximately 173,000 organic molecules and converted it into the Simplified Molecular - Input Line - Entry System (SMILES) format to construct a new data set, QM - symex - modif. Then, multiple chemical descriptors were generated using the RDKit tool as input for machine - learning algorithms, and 20 different machine - learning algorithms were applied to predict these target properties. In addition, Shapley Additive Explanations (SHAP) analysis was also carried out to evaluate the importance of input parameters for the studied machine - learning models. Through this method, the paper aims to improve the prediction accuracy of the properties of organic molecules in the excited state, thereby accelerating the discovery process of new materials, especially for applications in the fields of organic electronics and sensors.