Abstract:Precise web page classification can be achieved by evaluating features of web pages, and the structural features of web pages are effective complements to their textual features. Various classifiers have different characteristics, and multiple classifiers can be combined to allow classifiers to complement one another. In this study, a web page classification method based on heterogeneous features and a combination of multiple classifiers is proposed. Different from computing the frequency of HTML tags, we exploit the tree-like structure of HTML tags to characterize the structural features of a web page. Heterogeneous textual features and the proposed tree-like structural features are converted into vectors and fused. Confidence is proposed here as a criterion to compare the classification results of different classifiers by calculating the classification accuracy of a set of samples. Multiple classifiers are combined based on confidence with different decision strategies, such as voting, confidence comparison, and direct output, to give the final classification results. Experimental results demonstrate that on the Amazon dataset, 7-web-genres dataset, and DMOZ dataset, the accuracies are increased to 94.2%, 95.4%, and 95.7%, respectively. The fusion of the textual features with the proposed structural features is a comprehensive approach, and the accuracy is higher than that when using only textual features. At the same time, the accuracy of the web page classification is improved by combining multiple classifiers, and is higher than those of the related web page classification algorithms.

Study on Meaningful String Extraction Algorithm for Improving Webpage Classification

Web Page Classification Based on Heterogeneous Features and a Combination of Multiple Classifiers.

Exploiting Multi-Category Characteristics and Unified Framework to Extract Web Content

DEEP WEB DATA SOURCES CLASSIFICATION BASED ON TEXT VSM OF QUERY INTERFACE

Improving short text classification using public search engines

Sentiment Classification for Chinese Reviews: a Comparison Between SVM and Semantic Approaches

Domain-specific website recognition using hybrid vector space model

Combining Topic Models and String Kernel for Deep Web Categorization

Web Page Classification Based on Uncorrelated Semi-Supervised Intra-View and Inter-View Manifold Discriminant Feature Extraction

Keyword extraction using support vector machine

Semantic Term "Blurring" and Stochastic "Barcoding" for Improved Unsupervised Text Classification

PCCS：A FAST CLUSTERING AND CLASSIFICATION METHOD FOR WEB DOCUMENT

Learning to Cluster Web Search Results.

Combining classification with clustering for web person disambiguation.

An Efficient Information Extraction Mechanism with Page Ranking and a Classification Strategy based on Similarity Learning of Web Text Documents

Web Page Classification Based on SVM

A Method to Enhance Web Service Clustering by Integrating Label-Enhanced Functional Semantics and Service Collaboration

A Phrase-Based Method For Hierarchical Clustering Of Web Snippets

Hierarchically Classifying Chinese Web Documents Without Dictionary Support And Segmentation Procedure

Semi-automatic Algorithem Based on Web Service Classification

An Efficient Centroid Based Chinese Web Page Classifier