Abstract:Precise web page classification can be achieved by evaluating features of web pages, and the structural features of web pages are effective complements to their textual features. Various classifiers have different characteristics, and multiple classifiers can be combined to allow classifiers to complement one another. In this study, a web page classification method based on heterogeneous features and a combination of multiple classifiers is proposed. Different from computing the frequency of HTML tags, we exploit the tree-like structure of HTML tags to characterize the structural features of a web page. Heterogeneous textual features and the proposed tree-like structural features are converted into vectors and fused. Confidence is proposed here as a criterion to compare the classification results of different classifiers by calculating the classification accuracy of a set of samples. Multiple classifiers are combined based on confidence with different decision strategies, such as voting, confidence comparison, and direct output, to give the final classification results. Experimental results demonstrate that on the Amazon dataset, 7-web-genres dataset, and DMOZ dataset, the accuracies are increased to 94.2%, 95.4%, and 95.7%, respectively. The fusion of the textual features with the proposed structural features is a comprehensive approach, and the accuracy is higher than that when using only textual features. At the same time, the accuracy of the web page classification is improved by combining multiple classifiers, and is higher than those of the related web page classification algorithms.

Categorizing Web Information on Subject with Statistical Language Modeling

Web Page Classification Based on Heterogeneous Features and a Combination of Multiple Classifiers.

Research on Automatic Text Classification Based on a Hybrid Language Model

Application Of The Character-Level Statistical Method In Text Categorization

Experimental Study On Representing Units In Chinese Text Categorization

Chinese Documents Categorization Based on N-gram Information

Hierarchical classification of Chinese documents based onN-grams

Hierarchically Classifying Chinese Web Documents Without Dictionary Support And Segmentation Procedure

An experimental study on large-scale web categorization.

Hierarchical Classification of Chinese Documents Based on N-grams

Specific Website Subject Recognition Based on the Hybrid Vector Space Model

A Comparative Study on Semantic Orientation Classification of Chinese Text

Chinese Natural Language Processing: From Text Categorization to Machine Translation

Block-based language modeling approach towards web search

A High Performance Two-Class Chinese Text Categorization Method

A Hybrid Language Model Based on Statistics and Linguistic Rules

An Improved Text Categorization Algorithm Based on VSM

Toward a Unified Approach to Statistical Language Modeling for Chinese

Combining Topic Models and String Kernel for Deep Web Categorization

Study and System Implementation of Chinese Web-page Classification

Chinese Text Categorization Based On The Binary Weighting Model With Non-Binary Smoothing