Statistical feature extraction for cross-language web content quality assessment.

Guanggang Geng,Xiaodong Li,Li-Ming Wang,Wei Wang,Shuo Shen
DOI: https://doi.org/10.1145/2009916.2010083
2011-01-01
Abstract:Web content quality assessment is a typical static ranking problem. Heuristic content and TFIDF features based statistical systems have proven effective for Web content quality assessment. But they are all language dependent features, which are not suitable for cross-language ranking. In this paper, we fuse a series of language-independent features including hostname features, domain registration features, two-layer hyperlink analysis features and third-party Web service features to assess the Web content quality. The experiments on ECML/PKDD 2010 Discovery Challenge cross-language datasets show that the assessment is effective.
What problem does this paper attempt to address?