Abstract:Symbolic regression (SR) is a challenging task in machine learning that involves finding a mathematical expression for a function based on its values. Recent advancements in SR have demonstrated the effectiveness of pre-trained transformer-based models in generating equations as sequences, leveraging large-scale pre-training on synthetic datasets and offering notable advantages in terms of inference time over classical Genetic Programming (GP) methods. However, these models primarily rely on supervised pre-training goals borrowed from text generation and overlook equation discovery objectives like accuracy and complexity. To address this, we propose TPSR, a Transformer-based Planning strategy for Symbolic Regression that incorporates Monte Carlo Tree Search into the transformer decoding process. Unlike conventional decoding strategies, TPSR enables the integration of non-differentiable feedback, such as fitting accuracy and complexity, as external sources of knowledge into the transformer-based equation generation process. Extensive experiments on various datasets show that our approach outperforms state-of-the-art methods, enhancing the model's fitting-complexity trade-off, extrapolation abilities, and robustness to noise.

What problem does this paper attempt to address?

The problems that this paper attempts to solve are several key challenges in symbolic regression (SR). Specifically, the authors focus on how to generate more accurate and less complex mathematical expressions while improving the generalization ability of the model and its robustness to external noise. Traditional methods such as Genetic Programming (GP) can handle nonlinear and complex problems, but they have problems such as slow convergence speed, high computational cost, and being prone to over - fitting. Although the pre - trained Transformer - based models have significant advantages in inference time, they mainly rely on supervised learning objectives when generating equations and ignore equation discovery objectives such as accuracy and complexity. To solve these problems, the paper proposes a new method - Transformer - based Planning for Symbolic Regression (TPSR), which combines Monte Carlo Tree Search (MCTS) to optimize the equation sequence generation process. The main contributions of TPSR include: 1. **Introducing look - ahead planning**: By using the MCTS algorithm to guide the Transformer decoder, non - differentiable feedback information (such as fitting accuracy and complexity) is considered during the generation process, so as to generate better equations. 2. **Designing a new reward function**: Balance the fitting accuracy and complexity of the equation to optimize the generated equation and achieve a better trade - off between them. 3. **Performance improvement**: The experimental results show that TPSR consistently outperforms the existing state - of - the - art methods on multiple benchmark datasets, and the generated equations have higher fitting accuracy and lower complexity. 4. **Extrapolation ability and noise robustness**: TPSR shows stronger extrapolation ability and robustness to noise than the baseline methods. Through these improvements, TPSR not only improves the effectiveness of the symbolic regression task but also maintains high efficiency and avoids many problems existing in traditional methods. ### Formula presentation The formulas involved in the paper are as follows: - **Reward function**: \[ r(\tilde{f}(\cdot) | x, y) = \frac{1}{1 + \text{NMSE}(y, \tilde{f}(x))} + \lambda \exp\left(-\frac{l(\tilde{f}(\cdot))}{L}\right) \] where \( l \) represents the equation complexity, that is, the sequence length in prefix notation; \( L \) is the maximum sequence length of the model; \( \lambda \) is a hyperparameter used to balance the trade - off between fitting accuracy and complexity; NMSE is calculated as: \[ \text{NMSE} = \frac{\frac{1}{n} \| y - \tilde{f}(x) \|_2^2}{\frac{1}{n} \| y \|_2^2 + \epsilon} \] where \( \epsilon \) is a small constant used to prevent numerical instability. These improvements enable TPSR to achieve better performance in symbolic regression tasks and be more practical in practical applications.

Transformer-based Planning for Symbolic Regression

A Transformer Model for Symbolic Regression towards Scientific Discovery

End-to-end symbolic regression with transformers

Generative Pre-Trained Transformer for Symbolic Regression Base In-Context Reinforcement Learning

SymbolicGPT: A Generative Transformer Model for Symbolic Regression

Scalable Neural Symbolic Regression using Control Variables

Deep Generative Symbolic Regression

SymFormer: End-to-End Symbolic Regression Using Transformer-Based Architecture

Complexity-Aware Deep Symbolic Regression with Robust Risk-Seeking Policy Gradients

ParFam -- (Neural Guided) Symbolic Regression Based on Continuous Global Optimization

Discovering Mathematical Formulas from Data via GPT-guided Monte Carlo Tree Search

Transformers to Predict the Applicability of Symbolic Integration Routines

Controllable Neural Symbolic Regression

A Comparison of Recent Algorithms for Symbolic Regression to Genetic Programming

Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping

Logically Constrained Robotics Transformers for Enhanced Perception-Action Planning

Regression Planning Networks

Interpretable Machine Learning for Science with PySR and SymbolicRegression.jl

Discovering symbolic expressions with parallelized tree search

Symbolic Regression Algorithms with Built-in Linear Regression

A Functional Analysis Approach to Symbolic Regression