Combining Labeled And Unlabeled Data For Spam Classification

Zhen Yang,Jian Wang,Weiran Xu,Jun Guo
2007-01-01
Abstract:The considerable time and expense required for labeling data has prompted the development of algorithms which can generate classifiers form labeled and unlabeled data sets. In text classification, the Expectation Maximization (EM) algorithm is often used as a general framework estimating the parameters of a probability model from labeled and unlabeled data sets. Unfortunately, EM suffers from the following problems: 1). The algorithm lack robustness while there exists inconsistency between the probability distribution of labeled data sets and unlabeled background data sets; 2). The algorithm is frequently trapped in local optimum; 3). The stop condition for existing algorithm is ambiguous or inoperable. In this paper, by casting basic EM algorithm in a Bagging theoretic framework, a robust improvement is proposed. Specifically, an effective stop condition is given using transduction. Effectiveness of the models and feasibility of the present method are verified by experiments on spam detection.
What problem does this paper attempt to address?