Web Document Classification Techniques

孙建涛,沈抖,陆玉昌,石纯一
DOI: https://doi.org/10.3321/j.issn:1000-0054.2004.01.017
2004-01-01
Abstract:Web document classification assigns labels to web documents based on machine learning techniques. A review of various text classification techniques showed that the main difficulties in web document classification are the page representation methods and the classification algorithms. Techniques that go beyond text categorization approaches are needed. Probabilistic algorithms and relational learning methods are both time-consuming. SVM (support vector machine) classifiers are quite accurate but the automatic kernel selection and the large scale training are both key problems. Various measures were investigated to compare algorithm performance based on sample datasets.
What problem does this paper attempt to address?