Boosting SISSO Performance on Small Sample Datasets by Using Random Forests Prescreening for Complex Feature Selection

Xiaolin Jiang,Guanqi Liu,Jiaying Xie,Zhenpeng Hu
2024-09-28
Abstract:In materials science, data-driven methods accelerate material discovery and optimization while reducing costs and improving success rates. Symbolic regression is a key to extracting material descriptors from large datasets, in particular the Sure Independence Screening and Sparsifying Operator (SISSO) method. While SISSO needs to store the entire expression space to impose heavy memory demands, it limits the performance in complex problems. To address this issue, we propose a RF-SISSO algorithm by combining Random Forests (RF) with SISSO. In this algorithm, the Random Forest algorithm is used for prescreening, capturing non-linear relationships and improving feature selection, which may enhance the quality of the input data and boost the accuracy and efficiency on regression and classification tasks. For a testing on the SISSO's verification problem for 299 materials, RF-SISSO demonstrates its robust performance and high accuracy. RF-SISSO can maintain the testing accuracy above 0.9 across all four training sample sizes and significantly enhancing regression efficiency, especially in training subsets with smaller sample sizes. For the training subset with 45 samples, the efficiency of RF-SISSO was 265 times higher than that of original SISSO. As collecting large datasets would be both costly and time-consuming in the practical experiments, it is thus believed that RF-SISSO may benefit scientific researches by offering a high predicting accuracy with limited data efficiently.
Machine Learning,Materials Science,Data Analysis, Statistics and Probability
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to improve the performance and efficiency of material descriptor regression in materials science when the data set is small. Specifically, the paper focuses on the high - memory requirements and computational complexity problems caused by the need to store the entire expression space when using the Sure Independence Screening and Sparsifying Operator (SISSO) method for material descriptor extraction. These problems are particularly prominent when dealing with complex problems or small - sample data sets, which limit the performance of the SISSO method. To solve the above problems, the paper proposes a method that combines Random Forests (RF) with SISSO, called RF - SISSO. By using random forests for pre - screening before SISSO, capturing non - linear relationships and improving feature selection, the quality of the input data is improved, and the accuracy and efficiency of regression and classification tasks are enhanced. Especially on small - sample data sets, RF - SISSO can significantly improve the regression efficiency while maintaining high prediction accuracy. This enables RF - SISSO to provide efficient and accurate prediction capabilities when it is difficult to collect a large amount of data in actual experiments.