Abstract:Symbolic regression (SR) has emerged as a pivotal technique for uncovering the intrinsic information within data and enhancing the interpretability of AI models. However, current state-of-the-art (sota) SR methods struggle to perform correct recovery of symbolic expressions from high-noise data. To address this issue, we introduce a novel noise-resilient SR (NRSR) method capable of recovering expressions from high-noise data. Our method leverages a novel reinforcement learning (RL) approach in conjunction with a designed noise-resilient gating module (NGM) to learn symbolic selection policies. The gating module can dynamically filter the meaningless information from high-noise data, thereby demonstrating a high noise-resilient capability for the SR process. And we also design a mixed path entropy (MPE) bonus term in the RL process to increase the exploration capabilities of the policy. Experimental results demonstrate that our method significantly outperforms several popular baselines on benchmarks with high-noise data. Furthermore, our method also can achieve sota performance on benchmarks with clean data, showcasing its robustness and efficacy in SR tasks.
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the poor performance of existing Symbolic Regression (SR) methods when dealing with high - noise data. Specifically, the current state - of - the - art SR methods have difficulty in correctly recovering symbolic expressions from data containing a large amount of noise, which limits their effectiveness in practical application scenarios.
### Problem Background
Symbolic regression is a key technology for revealing the underlying information in data and enhancing the interpretability of AI models. However, existing SR methods face challenges when dealing with high - noise data, resulting in a significant decline in their performance. This limitation makes these methods difficult to cope with the complex data environment in the real world in practical applications.
### Solution
To solve this problem, the author proposes a new Noise - Resilient Symbolic Regression (NRSR) method. This method combines Reinforcement Learning (RL) and a designed Noise - Resilient Gating Module (NGM) to improve the ability to recover symbolic expressions from high - noise data. The main innovation points include:
1. **Noise - Resilient Gating Module (NGM)**: Dynamically filter meaningless information to reduce the impact of high - noise data on the symbolic regression process.
2. **Mixed Path Entropy (MPE)**: Introduce an MPE reward term in the RL process to enhance the exploration ability of the policy and prevent premature convergence to sub - optimal solutions.
### Method Overview
- **NGM Design and Training**: NGM uses L0 regularization to select input variables. By minimizing the Mean Squared Error (MSE) and imposing an L0 - norm constraint, the ability to resist noise is achieved. The formula is as follows:
\[
J(W, G)=\min _{W, G} \frac{1}{m} \sum_{i = 1}^{m}(y_{i}-W X'_{i})^{2}+\lambda\|G\|_{0}
\]
where $\|G\|_{0}$ represents the L0 - norm of G, that is, the number of non - zero parameters, and $\lambda$ is the regularization parameter.
- **Integrated Gating Layer and Action Mask**: Combine the trained gating layer G with the original action mask to further screen input variables and reduce the complexity of the search space.
- **Expression Generation and Reinforcement Learning**: Generate a sequence of expressions through RNN and use the PPO algorithm to optimize the policy, combining single - step entropy and path entropy to balance short - term and long - term exploration requirements.
### Experimental Results
The experimental results show that NRSR significantly outperforms the other five baseline methods in high - noise data benchmark tests and performs well in terms of Recovery Rate (RR), Exploration Expression Number (EEN), and Normalized Mean Squared Error (NMSE), etc. For example, in the case of 5 noisy inputs, the RR of NRSR reaches 89.1%, while the RR of most other methods is below 70%.
### Conclusion
The NRSR method proposed in this study not only performs excellently on high - noise data but also can reach the state - of - the - art performance level on clean data, demonstrating its robustness and effectiveness in symbolic regression tasks.