Hierarchical Contaminated Web Page Classification Based on Meta Tag Denoising Disposal

Xiang Song,Yi Zhu,Xuemei Zeng,Xingshu Chen
DOI: https://doi.org/10.1155/2021/2470897
IF: 1.968
2021-01-01
Security and Communication Networks
Abstract:Web page classification is critical for information retrieval. Most web page classification methods have the following two faults: (1) need to analyze based on the overall web page and (2) do not pay enough attention to the existence of noise information inside the web page, which will thus decrease the efficiency and classification performance, especially when classifying the contaminated web page. To solve these problems, this paper proposes a denoising disposal algorithm. We choose the top-down method for hierarchical classification to improve the prediction efficiency. The experimental results demonstrate that our method is about 7 times faster than the full-page method and achieves good classification results in most categories. The precision of 7 parent categories is all above 88% and is 24% higher than the other meta tag-based method on average.
What problem does this paper attempt to address?