Recent advances in the SISSO method and their implementation in the SISSO++ code

Thomas A. R. Purcell,Matthias Scheffler,Luca M. Ghiringhelli
2023-05-02
Abstract:Accurate and explainable artificial-intelligence (AI) models are promising tools for the acceleration of the discovery of new materials, ore new applications for existing materials. Recently, symbolic regression has become an increasingly popular tool for explainable AI because it yields models that are relatively simple analytical descriptions of target properties. Due to its deterministic nature, the sure-independence screening and sparsifying operator (SISSO) method is a particularly promising approach for this application. Here we describe the new advancements of the SISSO algorithm, as implemented into SISSO++, a C++ code with Python bindings. We introduce a new representation of the mathematical expressions found by SISSO. This is a first step towards introducing ``grammar'' rules into the feature creation step. Importantly, by introducing a controlled non-linear optimization to the feature creation step we expand the range of possible descriptors found by the methodology. Finally, we introduce refinements to the solver algorithms for both regression and classification, that drastically increase the reliability and efficiency of SISSO. For all of these improvements to the basic SISSO algorithm, we not only illustrate their potential impact, but also fully detail how they operate both mathematically and computationally.
Data Analysis, Statistics and Probability,Materials Science,Computational Physics
What problem does this paper attempt to address?
The paper primarily focuses on improving the symbolic regression method SISSO (Sure Independence Screening and Sparsifying Operator) to address some key challenges in materials science, particularly to accelerate the discovery of new materials and the research of new applications for existing materials. Specifically, the paper attempts to address the following core issues: 1. **Enhancing the interpretability and physical relevance of AI models**: Although AI has succeeded in describing physicochemical properties, creating AI models that are both interpretable and physically meaningful remains an unresolved challenge. To this end, the paper proposes a series of improvements to the SISSO algorithm. 2. **Enhancing the functionality of the SISSO algorithm**: By introducing new concepts and technical improvements, such as binary expression trees for feature representation, parameterized SISSO methods, linear programming for classification problems, and multi-residual methods, the performance of the SISSO algorithm is enhanced. 3. **Ensuring consistency of physical units and numerical stability**: By precisely handling units and range constraints, the generated mathematical expressions are ensured to be physically meaningful, and numerical instability issues are avoided. 4. **Optimizing the feature creation step**: The paper expands the feature space by introducing new feature representation methods (e.g., binary expression trees) and parameterized SISSO techniques, which helps in finding more complex and accurate analytical expressions. 5. **Improving the descriptor identification step**: This includes using linear programming to solve high-dimensional classification problems and introducing multi-residual methods to select the best feature combinations for constructing higher-dimensional models. In summary, the paper aims to enhance the effectiveness and practicality of the SISSO algorithm through a series of technical improvements, thereby better supporting data-driven research in materials science.