Transformer-based Planning for Symbolic Regression

Parshin Shojaee,Kazem Meidani,Amir Barati Farimani,Chandan K. Reddy
DOI: https://doi.org/10.48550/arXiv.2303.06833
2023-10-28
Abstract:Symbolic regression (SR) is a challenging task in machine learning that involves finding a mathematical expression for a function based on its values. Recent advancements in SR have demonstrated the effectiveness of pre-trained transformer-based models in generating equations as sequences, leveraging large-scale pre-training on synthetic datasets and offering notable advantages in terms of inference time over classical Genetic Programming (GP) methods. However, these models primarily rely on supervised pre-training goals borrowed from text generation and overlook equation discovery objectives like accuracy and complexity. To address this, we propose TPSR, a Transformer-based Planning strategy for Symbolic Regression that incorporates Monte Carlo Tree Search into the transformer decoding process. Unlike conventional decoding strategies, TPSR enables the integration of non-differentiable feedback, such as fitting accuracy and complexity, as external sources of knowledge into the transformer-based equation generation process. Extensive experiments on various datasets show that our approach outperforms state-of-the-art methods, enhancing the model's fitting-complexity trade-off, extrapolation abilities, and robustness to noise.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The problems that this paper attempts to solve are several key challenges in symbolic regression (SR). Specifically, the authors focus on how to generate more accurate and less complex mathematical expressions while improving the generalization ability of the model and its robustness to external noise. Traditional methods such as Genetic Programming (GP) can handle nonlinear and complex problems, but they have problems such as slow convergence speed, high computational cost, and being prone to over - fitting. Although the pre - trained Transformer - based models have significant advantages in inference time, they mainly rely on supervised learning objectives when generating equations and ignore equation discovery objectives such as accuracy and complexity. To solve these problems, the paper proposes a new method - Transformer - based Planning for Symbolic Regression (TPSR), which combines Monte Carlo Tree Search (MCTS) to optimize the equation sequence generation process. The main contributions of TPSR include: 1. **Introducing look - ahead planning**: By using the MCTS algorithm to guide the Transformer decoder, non - differentiable feedback information (such as fitting accuracy and complexity) is considered during the generation process, so as to generate better equations. 2. **Designing a new reward function**: Balance the fitting accuracy and complexity of the equation to optimize the generated equation and achieve a better trade - off between them. 3. **Performance improvement**: The experimental results show that TPSR consistently outperforms the existing state - of - the - art methods on multiple benchmark datasets, and the generated equations have higher fitting accuracy and lower complexity. 4. **Extrapolation ability and noise robustness**: TPSR shows stronger extrapolation ability and robustness to noise than the baseline methods. Through these improvements, TPSR not only improves the effectiveness of the symbolic regression task but also maintains high efficiency and avoids many problems existing in traditional methods. ### Formula presentation The formulas involved in the paper are as follows: - **Reward function**: \[ r(\tilde{f}(\cdot) | x, y) = \frac{1}{1 + \text{NMSE}(y, \tilde{f}(x))} + \lambda \exp\left(-\frac{l(\tilde{f}(\cdot))}{L}\right) \] where \( l \) represents the equation complexity, that is, the sequence length in prefix notation; \( L \) is the maximum sequence length of the model; \( \lambda \) is a hyperparameter used to balance the trade - off between fitting accuracy and complexity; NMSE is calculated as: \[ \text{NMSE} = \frac{\frac{1}{n} \| y - \tilde{f}(x) \|_2^2}{\frac{1}{n} \| y \|_2^2 + \epsilon} \] where \( \epsilon \) is a small constant used to prevent numerical instability. These improvements enable TPSR to achieve better performance in symbolic regression tasks and be more practical in practical applications.