Lexidate: Model Evaluation and Selection with Lexicase

Jose Guadalupe Hernandez,Anil Kumar Saini,Jason H. Moore
2024-06-18
Abstract:Automated machine learning streamlines the task of finding effective machine learning pipelines by automating model training, evaluation, and selection. Traditional evaluation strategies, like cross-validation (CV), generate one value that averages the accuracy of a pipeline's predictions. This single value, however, may not fully describe the generalizability of the pipeline. Here, we present Lexicase-based Validation (lexidate), a method that uses multiple, independent prediction values for selection. Lexidate splits training data into a learning set and a selection set. Pipelines are trained on the learning set and make predictions on the selection set. The predictions are graded for correctness and used by lexicase selection to identify parent pipelines. Compared to 10-fold CV, lexicase reduces the training time. We test the effectiveness of three lexidate configurations within the Tree-based Pipeline Optimization Tool 2 (TPOT2) package on six OpenML classification tasks. In one configuration, we detected no difference in the accuracy of the final model returned from TPOT2 on most tasks compared to 10-fold CV. All configurations studied here returned similar or less complex final pipelines compared to 10-fold CV.
Neural and Evolutionary Computing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the limitations of traditional model evaluation and selection methods (such as cross - validation, CV) in automated machine learning (AutoML). Specifically, although the traditional 10 - fold cross - validation (10 - fold CV) is effective, it has some problems: 1. **Limitations of a single performance metric**: Traditional methods usually generate an average value to represent the prediction accuracy of a model, but this may not comprehensively describe the generalization ability of the model. 2. **Low computational efficiency**: 10 - fold cross - validation requires multiple training and validation for each model, which increases the computational time and resource consumption. 3. **Overfitting risk**: Fixed data partitioning may lead to model overfitting, especially when dealing with small - sized data sets. To solve these problems, the author introduced a new method based on lexicase selection - **Lexicase - based Validation (Lexidate)**. The main features of Lexidate include: - **Multidimensional evaluation**: Use multiple independent predicted values for evaluation instead of a single average value. - **Improve computational efficiency**: Reduce the computational cost by reducing the number of training times. - **More flexible selection mechanism**: Through lexicase selection, selection pressure can be exerted on more difficult individual cases without sacrificing overall performance. To verify the effectiveness of Lexidate, the author compared it with 10 - fold cross - validation and tested the performance of three different Lexidate configurations (90/10, 70/30, 50/50 data partitioning) on six OpenML classification tasks. The experimental results show that on some tasks, Lexidate can achieve an accuracy rate similar to that of 10 - fold cross - validation, and the generated model has lower complexity, thus improving the computational efficiency. In summary, this paper aims to propose a new model evaluation and selection method, Lexidate, to overcome the limitations of traditional methods in AutoML, especially in terms of computational efficiency and model generalization ability.