A Cascade-based Classification Method for Class-imbalanced Data
Liu Xu-Ying,Wu Jian-Xin,Zhou Zhi-Hua
DOI: https://doi.org/10.3321/j.issn:0469-5097.2006.02.005
2006-01-01
Abstract:In machine learning and data mining,there are many aspects that might influence the performance of a learning system in real world applications.Class imbalance is one of these factors,in which training examples in one class heavily outnumber the examples in another class.Classifiers generally have difficulty in learning concept from the minority class.In many applications if the minority class is more important than the majority class,there will be great loss.There is severe class imbalance in the face detection problem,which greatly decreases the detection speed.The cascade structure is proposed to accelerate the learning process.Cascade is a classifier system with a sequence of n node classifiers.At the beginning,all training examples are available to train the first node classifier.Then all positive examples and only a subset of negative examples are passed to the next node,neglecting those negative examples correctly classified by the first node.This procedure repeats until all node classifiers are trained.A test example is passed to the next node if it is recognized as positive by the current node,or is rejected immediately as negative.However,the learning goal of a cascade node classifier is quite different to usual classifiers in the sense that every node aims to get a high detection rate and only a moderate false alarm rate.The cascade can achieve both high overall detection rate and low overall false alarm rate.Every time training examples are passed to the next node,there are some negatives that are neglected.That is,there are fewer negatives in the training set than those in the previous node.Considering the class imbalance problem,it means a more balanced training set,compared with training sets in previous nodes.In early nodes within a cascade it is quite easy to achieve the learning goal,i.e.train a classifier with high detection rate and only moderate false alarm rate.However,it becomes harder in deeper nodes,since the negative examples in these nodes are false positives from previous nodes and are difficult to separate from positive examples.And there's another difference between the face detection problem and general class imbalance problems.Hundreds of thousands of features are available for classifiers in the former case,but it is not the case for the latter one.In general class imbalance problems,a classifier in a deeper node may not easily achieve both a high detection rate and a moderate false alarm rate.Therefore,cascade-style test may not be appropriate in general class imbalance problems.Instead of testing new examples in a cascade sequential style,we combine all the node classifiers into an ensemble classifier and propose a cascade-based classification algorithm,BalanceCascade,to deal with class imbalance problems.Particularly,BalanceCascade employs Adaboost to train a classifier in each node,which is a weighted combination of several weak learners.Then weak learners within all node classifiers are collected to form the final ensemble without changing their original weights.Experimental results show that the method can effectively improve tie classification performance on imbalanced data sets,especially in the cases when classification performance is heavily affected by class imbalance.