Random Forest Variable Importance-based Selection Algorithm in Class Imbalance Problem

Yunbi Nam,Sunwoo Han
2023-12-17
Abstract:Random Forest is a machine learning method that offers many advantages, including the ability to easily measure variable importance. Class balancing technique is a well-known solution to deal with class imbalance problem. However, it has not been actively studied on RF variable importance. In this paper, we study the effect of class balancing on RF variable importance. Our simulation results show that over-sampling is effective in correctly measuring variable importance in class imbalanced situations with small sample size, while under-sampling fails to differentiate important and non-informative variables. We then propose a variable selection algorithm that utilizes RF variable importance and its confidence interval. Through an experimental study using many real and artificial datasets, we demonstrate that our proposed algorithm efficiently selects an optimal feature set, leading to improved prediction performance in class imbalance problem.
Machine Learning,Methodology
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to address the issue of variable importance measurement in Random Forest (RF) within the context of class imbalance problem. Specifically: 1. **Bias in Variable Importance**: - In the case of class imbalance, traditional variable importance measurement methods (such as Permutation Accuracy Importance based on classification accuracy) may not correctly distinguish between important variables and non-informative variables. - The authors point out that Permutation AUC Importance (based on the Area Under the ROC Curve) performs better in small sample scenarios but still has limitations. 2. **Effectiveness of Class Balancing Techniques**: - The authors investigate the impact of over-sampling and under-sampling on RF variable importance. - Through simulation experiments, the authors find that over-sampling can effectively improve the accuracy of variable importance measurement in small sample scenarios, while under-sampling fails to distinguish between important variables and non-informative variables. 3. **Improvement of Variable Selection Algorithm**: - Based on the above research, the authors propose a new variable selection algorithm that utilizes RF variable importance and its confidence intervals to efficiently select the optimal feature set. - Experimental validation shows that this algorithm performs excellently in class imbalance problems, significantly improving prediction performance. ### Main Contributions - **Theoretical Analysis**: A detailed discussion on the impact of class imbalance on RF variable importance, proposing Permutation AUC Importance as a solution. - **Experimental Validation**: Extensive simulation experiments and real datasets validate the effectiveness of over-sampling and demonstrate the limitations of under-sampling. - **Algorithm Innovation**: A new variable selection algorithm combining RF variable importance and confidence intervals is proposed, which efficiently selects the optimal feature set in class imbalance problems, thereby improving prediction performance. ### Conclusion Through theoretical analysis and experiments, this paper demonstrates that over-sampling can effectively improve the accuracy of RF variable importance measurement in small sample and class imbalance scenarios. The proposed variable selection algorithm not only performs comparably to existing methods in class-balanced problems but also significantly outperforms existing classification accuracy-based methods in class imbalance problems.