A Two-step Information Accumulation Strategy for Learning from Highly Imbalanced Data.

Bin Liu,Min Zhang,Weizhi Ma,Xin Li,Yiqun Liu,Shaoping Ma
DOI: https://doi.org/10.1145/3132847.3132940
2017-01-01
Abstract:Highly imbalanced data is common in the real world and it is important but difficult to train an effective classifier. In this paper, Our major point is that the imbalance is the observed phenomenon but not the cause of the problem. The challenge is that useful information is been overshadowed in the large scale of data in both majority and minority classes. We propose a novel two-step strategy, Information Accumulation, which first selects the most discriminative data by the Zooming-in phase, and then leverages unlabeled data by pseudo active learning and self-training in the phase of Learning from Learned Results. Comparative experiments are conducted on large-scale highly imbalanced real customer service data on complaint detection task (where less than 2% of data is positive). The results on eight state-of-the-art classification algorithms show that significant improvements are observed on the performances of all algorithms with Information Accumulation(for example, the F-Measure score of Xgboost is increased by 197% from 0.115 to 0.347), which demonstrates the effectiveness and general applicability of the proposed strategy. This work explores a new idea on dealing with highly imbalanced data that we do not aim to balance the training examples as usual, but focus on finding the most discriminative information from labeled data and the learning results of unlabeled data.
What problem does this paper attempt to address?