Evaluation of Feature Selection Methods Using Bagging and Boosting Ensemble Techniques on High Throughput Biological Data

Jiamin Wu,Shengjia Chen,Wenbin Zhou,Ningya Wang,Ziling Fan
DOI: https://doi.org/10.1145/3397391.3397403
2020-01-01
Abstract:Feature selection technique has become a basic but desired technique when analyzing high-throughput biological data due to its nature of large p and small n. In recent years, ensemble learning based feature selection methods have been widely proposed and studied. Ensemble methods employ multiple learning algorithms to obtain better predictive performance than any of the constituent learning algorithms separately. Also, the feature selected by ensemble classifiers can yield more accurate classification performance and more robust results. In our work, the bagging algorithm Random Forest (RF), and the boosting algorithms Gradient Boosting Decision Tree (GBDT), Extreme Gradient Boosting (XGBoost) are the main research objects. We compared the accuracy and robustness of three algorithms on six different datasets from TCGA database. Also, the three feature selection algorithms are further ensembled using a bagging procedure for the purpose of comparison with the original single classifier. The results of our work indicated that for single base feature selectors, boosting algorithms all outperform than bagging one in both performance and robustness. By applying the bagging-based feature selection procedure, the robustness of three single base feature selectors is improved significantly, but the accuracy of them is slightly reduced. GBDT with bagging-based feature selection procedure achieved the best performance using our proposed comprehensive metric which balances equally accuracy and robustness.
What problem does this paper attempt to address?