Lexicase-based Selection Methods with Down-sampling for Symbolic Regression Problems: Overview and Benchmark

Alina Geiger,Dominik Sobania,Franz Rothlauf
2024-07-31
Abstract:In recent years, several new lexicase-based selection variants have emerged due to the success of standard lexicase selection in various application domains. For symbolic regression problems, variants that use an epsilon-threshold or batches of training cases, among others, have led to performance improvements. Lately, especially variants that combine lexicase selection and down-sampling strategies have received a lot of attention. This paper evaluates random as well as informed down-sampling in combination with the relevant lexicase-based selection methods on a wide range of symbolic regression problems. In contrast to most work, we not only compare the methods over a given evaluation budget, but also over a given time as time is usually limited in practice. We find that for a given evaluation budget, epsilon-lexicase selection in combination with random or informed down-sampling outperforms all other methods. Only for a rather long running time of 24h, the best performing method is tournament selection in combination with informed down-sampling. If the given running time is very short, lexicase variants using batches of training cases perform best.
Neural and Evolutionary Computing
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is: in the symbolic regression problem, evaluate and compare the effectiveness of different selection methods (especially the lexicase selection method combined with down - sampling strategies). Specifically, the author hopes to answer the following questions: 1. **Which selection method performs best within a given evaluation budget?** 2. **Which selection method performs best within a limited time budget?** ### Main Research Content To answer these questions, the author conducted extensive research, involving the following aspects: - **Selection Methods**: including traditional Tournament Selection, Lexicase Selection, 𝜖 - Lexicase Selection (with a threshold), Batch - Tournament Selection, and Batch - 𝜖 - Lexicase Selection. - **Down - Sampling Strategies**: including Random Down - sampling and Informed Down - sampling. Down - sampling aims to save computational resources by reducing the number of training samples used in each generation, thus allowing for a longer search or an increased population size. ### Research Findings 1. **For a Given Evaluation Budget**: - 𝜖 - Lexicase Selection combined with Random Down - sampling or Informed Down - sampling (𝜖 - lex with rds or ids) performs best. - If the running time is very short (for example, 15 minutes), the batch - processed variants (such as Batch - Tournament Selection and Batch - 𝜖 - Lexicase Selection) perform best. 2. **For a 24 - Hour Time Budget**: - The best method is Tournament Selection combined with Informed Down - sampling (tourn with ids). - For almost all of the studied selection methods, the methods combined with Informed Down - sampling outperform Random Down - sampling within a given 24 - hour time budget. ### Summary This paper provides researchers and practitioners with a comprehensive guide by comprehensively evaluating multiple selection methods and down - sampling strategies to help them select appropriate selection methods in symbolic regression problems. The research shows that the lexicase selection method combined with down - sampling performs excellently in most cases, especially when given an evaluation budget. However, for longer running times (such as 24 hours), traditional selection methods combined with Informed Down - sampling (such as Tournament Selection) may be better. ### Formula Examples - **Mean Squared Error (MSE)**: \[ \text{MSE}(T)=\frac{1}{|T|}\sum_{t\in T}(y_t - \hat{y}_t)^2 \] where \(\hat{y}_t\) is the predicted output of an individual for training sample \(t\), and \(y_t\) is the desired output. - **𝜖 - Value Calculation**: \[ \epsilon_t=\text{median}(|e_t-\text{median}(e_t)|) \] where \(e_t\) is the error vector of all individuals in the candidate pool \(C\) on the current training sample \(t\). Through these formulas, the author is able to quantify the effects of different selection methods and down - sampling strategies and draw the above conclusions.