Machine Learning-Assisted High-Throughput Semi-empirical Search of OFET Molecular Materials

Zhenyu Chen,Jiahao Li,Yuzhi Xu
DOI: https://doi.org/10.48550/arXiv.2107.02613
2021-07-06
Abstract:Machine learning has been widely verified and applied in chemoinformatics, and have achieved outstanding results in the prediction, modification, and optimization of luminescence, magnetism, and electrode materials. Here, we propose a deepth first search traversal (DFST) approach combined with lightGBM machine learning model to search the classic Organic field-effect transistor (OFET) functional molecules chemical space, which is simple but effective. Totally 2820588 molecules of different structure within two certain types of skeletons are generated successfully, which shows the searching efficiency of the DFST strategy. With the simplified molecular-input line-entry system (SMILES) utilized, the generation of alphanumeric strings that describe molecules directly tackle the inverse design problem, for the generation set has 100% chemical validity. Light Gradient Boosting Machine (LightGBM) model's intrinsic Distributed and efficient features enables much faster training process and higher training efficiency, which means better model performance with less amount of data. 184 out of 2.8 million molecules are finally screened out with density functional theory (DFT) calculation carried out to verify the accuracy of the prediction.
Materials Science
What problem does this paper attempt to address?
The problem this paper attempts to address is: how to efficiently screen high-performance molecular materials suitable for organic field-effect transistors (OFETs), particularly n-type semiconductor materials. Specifically, the authors propose a method based on Depth-First Search Traversal (DFST) and the LightGBM machine learning model to generate and screen molecules with specific skeleton structures, and verify the accuracy of the predictions through Density Functional Theory (DFT) calculations. ### Main Issues 1. **Efficient Generation and Screening of Molecules**: - Traditional methods are time-consuming and costly when generating and screening a large number of molecules. - There is a need for an efficient method to generate and preliminarily screen a large number of potential OFET molecular materials. 2. **Improving Prediction Accuracy**: - Traditional DFT calculations, while accurate, are computationally intensive and time-consuming. - It is necessary to combine machine learning methods to quickly predict the HOMO and LUMO energy levels of molecules and ensure the accuracy of the predictions. 3. **Balancing Screening Efficiency and Accuracy**: - When screening large datasets, it is necessary to find a reasonable error range to balance screening efficiency and prediction accuracy. - Avoid screening too few molecules due to overly high precision requirements, or screening too many molecules due to overly low precision requirements, which would increase the cost of subsequent DFT verification. ### Solutions 1. **Depth-First Search Traversal (DFST) Generator**: - Use the DFST algorithm to generate molecules with specific skeleton structures, such as tetracene and pentacene. - Generate a large number of molecular structures by replacing carbon atoms and adding functional groups. 2. **LightGBM Machine Learning Model**: - Use the LightGBM model combined with molecular fingerprints (ECFP4) to predict the HOMO and LUMO energy levels of molecules. - Quickly screen out molecules with high electron transport performance. 3. **DFT Secondary Screening**: - Perform DFT calculations on the preliminarily screened molecules to verify the accuracy of their HOMO and LUMO energy levels. - Further optimize the screening criteria to ensure that the finally screened molecules have high performance. 4. **Optimizing Screening Criteria**: - Design a desired function to discuss the standards for a reasonable error range. - Find a balance point to optimize screening efficiency and data accuracy. Through the above methods, the authors successfully generated 2,820,588 molecules with different structures and finally screened out 184 high-performance OFET molecular materials. This method is thousands of times faster than traditional high-throughput DFT screening while maintaining high prediction accuracy.