Dealing with the big data challenges in AI for thermoelectric materials

Xue Jia,Alex Aziz,Yusuke Hashimoto,Hao Li
DOI: https://doi.org/10.1007/s40843-023-2777-2
2024-03-14
Science China Materials
Abstract:The development of artificial intelligence (AI), particularly, data science and machine learning (ML), is revolutionizing the field of material science. Yet, some inevitable key challenges remain, including errors contained in large-scale material datasets and the overfitting of predicted temperature-dependent properties. In this work, using thermoelectric (TE) materials as an archetypal example, we firstly performed a series of rational actions to identify and discard questionable data, and obtained 92,291 data points consisting of 7295 compositions and different temperatures from the Starrydata2 database. Next, we proposed a composition-based cross-validation method to emphasize that the data points with the same compositions but different temperatures should not be split into different sets to avoid overfitting. Then, we built ML models using the gradient boosting decision tree (GBDT) method, and achieved remarkable R 2 values of ∼0.89, ∼0.90, and ∼0.89 on the training dataset, test dataset, and new out-of-sample experimental data published in 2023, verifying the model's high accuracy in predicting newly available materials. Using this ML model, we carried out a large-scale evaluation of the stable materials from the Materials Project database, and Ge 2 Te 5 As 2 and Ge 3 (Te 3 As) 2 were predicted to exhibit high zT values. Density functional theory calculations were then executed and the calculated maximum zT values were 1.98 and 2.12 for n- and p-type Ge 2 Te 5 As 2 , and 0.58 and 0.74 for n- and p-type Ge 3 (Te 3 As) 2 , respectively, indicating their potential as TE materials and supporting our ML model. This work presents an example of dealing with and overcoming big data challenges in AI for materials science.
materials science, multidisciplinary
What problem does this paper attempt to address?
The paper primarily focuses on addressing the big data challenges in the field of thermoelectric materials, particularly on how to utilize artificial intelligence (AI) technology to improve the efficiency of thermoelectric material screening and the accuracy of performance prediction. Specifically, the research addresses the following key issues: 1. **Data Quality Issues**: Handling erroneous data present in large-scale thermoelectric material databases, such as typographical errors in publications and experimental errors. Low-quality data is identified and removed through reasonable strategies. 2. **Overfitting Issues**: Avoiding overfitting phenomena related to temperature-dependent properties during machine learning modeling, ensuring that the model can effectively predict the performance of new materials. To this end, a composition-based cross-validation method is proposed. 3. **Establishing Efficient Prediction Models**: Utilizing optimized datasets to construct machine learning models to predict the dimensionless figure of merit (zT value) of new materials. The model demonstrates high prediction accuracy, performing well on the training set, test set, and newly published experimental data. 4. **Prediction and Validation of New Materials**: Based on the established model, a series of potential high-performance thermoelectric materials were predicted, and the potential of two materials (Ge2Te5As2 and Ge3(Te3As)2) as thermoelectric materials was further validated through density functional theory calculations. In summary, this paper proposes a systematic approach to addressing the big data challenges in the field of thermoelectric materials, accelerating the discovery process of high-performance thermoelectric materials through machine learning technology.