A hybrid data-level ensemble to enable learning from highly imbalanced dataset

Zhi Chen,Jiang Duan,Li Kang,Guoping Qiu
DOI: https://doi.org/10.1016/j.ins.2020.12.023
IF: 8.1
2021-04-01
Information Sciences
Abstract:<p>Highly imbalanced class distribution has been well-recognized as a major cause of performance degradation for most supervised learning algorithms. Unfortunately, such detrimental distribution inherently occurs in various real-world applications. In this work, we developed a hybrid data-level ensemble (HD-Ensemble), which integrates ensemble learning with the union of a margin-based undersampling and diversity-enhancing oversampling. The proposed undersampling method filters out certain number of unrepresentative majority instances based on an unsupervised margin definition, while the proposed oversampling method generates diverse minority instances according to the behavior of ensemble learning. The combination of the two data-level approaches serves a twofold purpose of balancing the data distribution, and optimizing the fundamental properties (e.g., margin distribution and diversity) of the ensemble, therefore, the inferior performance caused by adopting single data-level approach can be better addressed. Targeting on binary classification task, we evaluated the HD-Ensemble on 42 highly imbalanced datasets, which exhibited a considerable variety in sample number (ranging from 129 to 20,034), feature number (ranging from 3 to 5,000) and imbalance ratio (ranging from 9.08 to 970.6). Experimental results demonstrated the performance advantages of proposed HD-Ensemble over ten other ensemble solutions.</p>
computer science, information systems
What problem does this paper attempt to address?