An inorganic ABX3 perovskite materials dataset for target property prediction and classification using machine learning

Ericsson Tetteh Chenebuah,David Tetteh Chenebuah
2023-12-19
Abstract:The reliability with Machine Learning (ML) techniques in novel materials discovery often depend on the quality of the dataset, in addition to the relevant features used in describing the material. In this regard, the current study presents and validates a newly processed materials dataset that can be utilized for benchmark ML analysis, as it relates to the prediction and classification of deterministic target properties. Originally, the dataset was extracted from the Open Quantum Materials Database (OQMD) and contains a robust 16,323 samples of ABX3 inorganic perovskite structures. The dataset is tabular in form and is preprocessed to include sixty-one generalized input features that broadly describes the physicochemical, stability/geometrical, and Density Functional Theory (DFT) target properties associated with the elemental ionic sites in a three-dimensional ABX3 polyhedral. For validation, four different ML models are employed to predict three distinctive target properties, namely: formation energy, energy band gap, and crystal system. On experimentation, the best accuracy measurements are reported at 0.013 eV/atom MAE, 0.216 eV MAE, and 85% F1, corresponding to the formation energy prediction, band gap prediction and crystal system multi-classification, respectively. Moreover, the realized results are compared with previous literature and as such, affirms the resourcefulness of the current dataset for future benchmark materials analysis via ML techniques. The preprocessed dataset and source codes are openly available to download from <a class="link-external link-http" href="http://github.com/chenebuah/ML_abx3_dataset" rel="external noopener nofollow">this http URL</a>.
Materials Science
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve The paper aims to address the problem of predicting and classifying the properties of inorganic perovskite materials (ABX₃ type). Specifically, the authors constructed and validated a new materials dataset for standard benchmark analysis using machine learning (ML) techniques. This dataset was extracted from the Open Quantum Materials Database (OQMD) and contains 16,323 samples, each described by 61 universal input features covering physicochemical, stability, and geometric properties as well as density functional theory (DFT) target properties. ### Main Research Objectives 1. **Dataset Construction and Validation**: Construct a high-quality dataset for training and validating machine learning models. 2. **Property Prediction**: Use machine learning models to predict three key properties: formation energy, band gap, and crystal system. 3. **Model Performance Evaluation**: Conduct experiments on the dataset using various machine learning models (such as Support Vector Machine SVM, Random Forest RFR, Extreme Gradient Boosting XGB, and Light Gradient Boosting LGBM) to evaluate their performance in prediction and classification tasks. 4. **Benchmark Comparison**: Compare the results of the current study with those in previous literature to validate the effectiveness of the dataset and the reliability of the models. ### Research Background Inorganic perovskite structures are at the forefront of new energy materials discovery due to their wide range of compositions and configurations. These materials exhibit multifunctionality in various engineering applications, including superconductivity, piezoelectricity, ferroelectricity, optoelectronics, and catalysis. However, traditional first-principles methods and experimental synthesis approaches face high computational costs and resource consumption when dealing with large-scale materials design. Therefore, machine learning techniques are widely applied for the rapid and efficient prediction of material properties. ### Research Methods 1. **Dataset Generation and Preprocessing**: Extract raw data samples from OQMD, followed by screening and cleaning to ensure the dataset contains samples that conform to the perovskite description and exclude unstable compounds. 2. **Feature Selection**: Each material sample is described by 61 input features, categorized into physicochemical properties, stability/geometric properties, and properties extracted from OQMD. 3. **Machine Learning Models**: Use four efficient tabular dataset machine learning models (SVM, RFR, XGB, and LGBM) for regression and classification analysis. 4. **Results Evaluation**: Evaluate model performance using metrics such as Mean Absolute Error (MAE), Root Mean Square Error (RMSE), Coefficient of Determination (R²), and F1 score. ### Research Results 1. **Formation Energy Prediction**: The SVM model performed best in the formation energy prediction task, with an MAE of 0.013 eV/atom, RMSE of 0.070 eV/atom, and R² of 99.45%. 2. **Band Gap Prediction**: The LGB model performed best in the band gap prediction task, with an MAE of 0.216 eV, RMSE of 0.440 eV, and R² of 87.90%. 3. **Crystal System Multi-classification**: By using undersampling and oversampling techniques to handle the imbalanced dataset, the SVC, XGB, and LGB models showed similar performance in the multi-classification task, with an average F1 score of 0.85. ### Conclusion This study constructed and validated a high-quality perovskite materials dataset for benchmark analysis using machine learning models. The experimental results indicate that the dataset has high accuracy in predicting formation energy, band gap, and crystal system, and can effectively handle imbalanced datasets in multi-classification tasks. These results provide strong support for further materials science research.