Web Spam Detection by the Genetic Programming-based Ensemble Learning

NIU Xiaofei,MA Jun,MA Shaoping,ZHANG Dongmei
DOI: https://doi.org/10.3969/j.issn.1003-0077.2012.05.015
2012-01-01
Abstract:Web spam detection is a challenging issue for web search engines.This paper proposes a Genetic Programming-based ensemble learning approach(GPENL) to detect web spam.First,the method gets t different training sets by the under-sampling from the original training set.Then,c different classification algorithms are used on t training sets to get t*c base classifiers.Finally,an integrated approach of t*c base classifiers is obtained by Genetic Programming.The new method can not only merge the under-sampling technology and ensemble learning to improve the classification performance on imbalanced datasets,but also conveniently integrate different types of base classifiers.The experiments on WEBSPAM-UK2006 show that this method improve the classification performance whether the base classifiers belong to the same type or not,and in most cases the heterogeneous classifier ensembles work better than the homogeneous ones;and GPENL can get higher F-measure than those done by AdaBoost,Bagging,RandomForest,Vote,EDKC algorithm and the method based on Prediction Spamicity.
What problem does this paper attempt to address?