Less is More: Feature Choosing under Privacy-Preservation for Efficient Web Spam Detection

Yan Zhu
DOI: https://doi.org/10.1007/978-3-030-86475-0_1
2021-01-01
Abstract:Researches on detecting Web spam are in full swing. However, very high feature dimension and sensitive information leakage restrict the mining. In this paper, a cascade feature selection for mining spam is proposed, which bases on Privacy Preservation (PP) method and Genetic Algorithm (GA). Two criteria, privacy protection degree and maximum classification reliability, are used to pick the representative features to form an optimal minimum feature subset. Discretization, data balancing, feature selection, and ensemble learning method are integrated to detect Web spam. The approach not only greatly reduces the data dimension but also protects the sensitive features from detection. Good spam detection performance is achieved by using only 22 features.
What problem does this paper attempt to address?