Classifying Imbalanced Data Streams via Dynamic Feature Group Weighting with Importance Sampling

Ke Wu,Andrea Edwards,Wei Fan,Jing Gao,Kun Zhang
DOI: https://doi.org/10.1137/1.9781611973440.83
2014-04-01
Abstract:Data stream classification and imbalanced data learning are two important areas of data mining research. Each has been well studied to date with many interesting algorithms developed. However, only a few approaches reported in literature address the intersection of these two fields due to their complex interplay. In this work, we proposed an importance sampling driven, dynamic feature group weighting framework (DFGW-IS) for classifying data streams of imbalanced distribution. Two components are tightly incorporated into the proposed approach to address the intrinsic characteristics of concept-drifting, imbalanced streaming data. Specifically, the ever-evolving concepts are tackled by a weighted ensemble trained on a set of feature groups with each sub-classifier (i.e. a single classifier or an ensemble) weighed by its discriminative power and stable level. The un-even class distribution, on the other hand, is typically battled by the sub-classifier built in a specific feature group with the underlying distribution rebalanced by the importance sampling technique. We derived the theoretical upper bound for the generalization error of the proposed algorithm. We also studied the empirical performance of our method on a set of benchmark synthetic and real world data, and significant improvement has been achieved over the competing algorithms in terms of standard evaluation metrics and parallel running time. Algorithm implementations and datasets are available upon request.
Medicine
What problem does this paper attempt to address?