Hybrid Dimensionality Reduction Forest with Pruning for High-Dimensional Data Classification
Weihong Chen,Yuhong Xu,Zhiwen Yu,Wenming Cao,C. L. Philip Chen,Guoqiang Han
DOI: https://doi.org/10.1109/access.2020.2975905
IF: 3.9
2020-01-01
IEEE Access
Abstract:The classification of high-dimensional data is a challenge in machine learning. Traditional classifier ensemble methods improve the diversity of classifiers through either dimensionality reduction or sample selection for high-dimensional data classification. However, these methods have several limitations: 1) dimensionality reduction methods easily cause information loss, which leads to a decrease in accuracy; 2) sample selection methods are susceptible to noise and redundant features. To address the above limitations, we propose a novel hybrid dimensionality reduction forest (HDRF) to increase the diversity of an integrated system from feature space and sample space. First, a tree-based feature selection algorithm is employed to partition effective features. Then the Bagging method is applied to obtain diverse training subsets. To fully retain and mine the important information of the unselected samples, a sample-feature based transformation process (SFTP) is proposed to generate the extended features. Since PCA can effectively reduce dimension and remove noise features, it is applied to compress the unselected features and the extended features into the new features which are compact and compensatory. Further, a novel classifier ensemble pruning framework (HDRFPF) based on HDRF is designed to remove redundant and invalid classifiers. Experimental results on 23 high-dimensional data sets verify that our method outperforms mainstream classifier ensemble methods, and the better results are obtained on 19 out of 23 datasets.
What problem does this paper attempt to address?