Sevil Zanjani Miyandoab,Shahryar Rahnamayan,Azam Asilian Bidgoli,Sevda Ebrahimi,Masoud Makrehchi
Abstract:Feature selection plays a pivotal role in the data preprocessing and model-building pipeline, significantly enhancing model performance, interpretability, and resource efficiency across diverse domains. In population-based optimization methods, the generation of diverse individuals holds utmost importance for adequately exploring the problem landscape, particularly in highly multi-modal multi-objective optimization problems. Our study reveals that, in line with findings from several prior research papers, commonly employed crossover and mutation operations lack the capability to generate high-quality diverse individuals and tend to become confined to limited areas around various local optima. This paper introduces an augmentation to the diversity of the population in the well-established multi-objective scheme of the genetic algorithm, NSGA-II. This enhancement is achieved through two key components: the genuine initialization method and the substitution of the worst individuals with new randomly generated individuals as a re-initialization approach in each generation. The proposed multi-objective feature selection method undergoes testing on twelve real-world classification problems, with the number of features ranging from 2,400 to nearly 50,000. The results demonstrate that replacing the last front of the population with an equivalent number of new random individuals generated using the genuine initialization method and featuring a limited number of features substantially improves the population's quality and, consequently, enhances the performance of the multi-objective algorithm.
What problem does this paper attempt to address?
### The problems the paper attempts to solve
The paper aims to solve the problem of insufficient diversity in multi - objective feature selection. In population - based optimization methods, generating diverse individuals is particularly important for fully exploring the problem space, especially in highly multi - modal multi - objective optimization problems. However, the commonly used crossover and mutation operations lack the ability to generate high - quality diverse individuals and are prone to getting trapped in the limited area around the local optimal solution. Therefore, the paper proposes a method to enhance the population diversity in the NSGA - II algorithm, which is achieved through two key components: a true initialization method and a method of replacing the worst individuals with newly randomly generated individuals in each generation as a re - initialization method. This method was tested on 12 real - world classification problems with the number of features ranging from 2,400 to nearly 50,000. The results show that replacing the worst individuals with new random individuals generated by the true initialization method significantly improves the quality of the population, thereby enhancing the performance of the multi - objective algorithm.
### Specific problem description
1. **Importance of feature selection**:
- Feature selection plays a crucial role in the data pre - processing and model building pipeline and can significantly improve model performance, interpretability, and resource efficiency.
- Removing irrelevant and redundant information not only reduces the computational requirements but also improves the performance of the classifier by alleviating the curse of dimensionality and simplifies model interpretation.
2. **Limitations of existing methods**:
- The commonly used crossover and mutation operations lack the ability to generate high - quality diverse individuals and are prone to getting trapped in the local optimal solution.
- This limitation is particularly evident in highly multi - modal multi - objective optimization problems because it is necessary to fully explore the problem space to find the global optimal solution.
3. **The paper's solution**:
- Proposes a method to enhance the population diversity in the NSGA - II algorithm, which is achieved through the following two key components:
- **True initialization method**: Ensure that the initial population has a high degree of diversity.
- **Replacing the worst individuals**: Replace the worst individuals with newly randomly generated individuals in each generation to further enhance the population diversity.
- Through these methods, the algorithm can explore in a wider search space, avoid premature convergence, and thus improve the performance of multi - objective optimization.
### Experimental results
- **Experimental setup**:
- Use 12 real - world classification problems for testing, with the number of features ranging from 2,400 to nearly 50,000.
- Each algorithm is run 31 times, and each time 20% of the data is randomly selected as the test set, using repeated random sampling validation or Monte Carlo cross - validation.
- Fix the number of function calls to 15,000 to ensure a fair comparison.
- **Performance evaluation**:
- Use hypervolume (HV) as a multi - objective evaluation metric, with the reference point set to (1, 1).
- The results show that the proposed method has a significantly better HV value than the conventional NSGA - II on all data sets, with an average HV value reaching 0.97.
- The method of replacing the worst individuals significantly increases the population diversity and improves the exploration ability of the algorithm.
### Conclusion
The paper effectively solves the problem of insufficient diversity in multi - objective feature selection by introducing the true initialization method and the strategy of replacing the worst individuals, and significantly improves the performance of the NSGA - II algorithm. The experimental results on multiple real - world data sets show that this method can better explore the search space, avoid premature convergence, and thus obtain better multi - objective optimization solutions.