Dealing with the big data challenges in AI for thermoelectric materials
Xue Jia,Alex Aziz,Yusuke Hashimoto,Hao Li
DOI: https://doi.org/10.1007/s40843-023-2777-2
2024-03-14
Science China Materials
Abstract:The development of artificial intelligence (AI), particularly, data science and machine learning (ML), is revolutionizing the field of material science. Yet, some inevitable key challenges remain, including errors contained in large-scale material datasets and the overfitting of predicted temperature-dependent properties. In this work, using thermoelectric (TE) materials as an archetypal example, we firstly performed a series of rational actions to identify and discard questionable data, and obtained 92,291 data points consisting of 7295 compositions and different temperatures from the Starrydata2 database. Next, we proposed a composition-based cross-validation method to emphasize that the data points with the same compositions but different temperatures should not be split into different sets to avoid overfitting. Then, we built ML models using the gradient boosting decision tree (GBDT) method, and achieved remarkable R 2 values of ∼0.89, ∼0.90, and ∼0.89 on the training dataset, test dataset, and new out-of-sample experimental data published in 2023, verifying the model's high accuracy in predicting newly available materials. Using this ML model, we carried out a large-scale evaluation of the stable materials from the Materials Project database, and Ge 2 Te 5 As 2 and Ge 3 (Te 3 As) 2 were predicted to exhibit high zT values. Density functional theory calculations were then executed and the calculated maximum zT values were 1.98 and 2.12 for n- and p-type Ge 2 Te 5 As 2 , and 0.58 and 0.74 for n- and p-type Ge 3 (Te 3 As) 2 , respectively, indicating their potential as TE materials and supporting our ML model. This work presents an example of dealing with and overcoming big data challenges in AI for materials science.
materials science, multidisciplinary