An Improvement to Feature Selection of Random Forests on Spark

Ke Sun,Wansheng Miao,Xin Zhang,Ruonan Rao
DOI: https://doi.org/10.1109/CSE.2014.159
2014-01-01
Abstract:The Random Forests algorithm belongs to the class of ensemble learning methods, which are common used in classification problem. In this paper, we studied the problem of adopting the Random Forests algorithm to learn raw data from real usage scenario. An improvement, which is stable, strict, high efficient, data-driven, problem independent and has no impact on algorithm performance, is proposed to investigate 2 actual issues of feature selection of the Random Forests algorithm. The first one is to eliminate noisy features, which are irrelevant to the classification. And the second one is to eliminate redundant features, which are highly relevant with other features, but useless. We implemented our improvement approach on Spark. Experiments are performed to evaluate our improvement and the results show that our approach has an ideal performance.
What problem does this paper attempt to address?