Abstract:Precise web page classification can be achieved by evaluating features of web pages, and the structural features of web pages are effective complements to their textual features. Various classifiers have different characteristics, and multiple classifiers can be combined to allow classifiers to complement one another. In this study, a web page classification method based on heterogeneous features and a combination of multiple classifiers is proposed. Different from computing the frequency of HTML tags, we exploit the tree-like structure of HTML tags to characterize the structural features of a web page. Heterogeneous textual features and the proposed tree-like structural features are converted into vectors and fused. Confidence is proposed here as a criterion to compare the classification results of different classifiers by calculating the classification accuracy of a set of samples. Multiple classifiers are combined based on confidence with different decision strategies, such as voting, confidence comparison, and direct output, to give the final classification results. Experimental results demonstrate that on the Amazon dataset, 7-web-genres dataset, and DMOZ dataset, the accuracies are increased to 94.2%, 95.4%, and 95.7%, respectively. The fusion of the textual features with the proposed structural features is a comprehensive approach, and the accuracy is higher than that when using only textual features. At the same time, the accuracy of the web page classification is improved by combining multiple classifiers, and is higher than those of the related web page classification algorithms.

Site abstraction for rare category classification in large-scale web directory.

Exploiting Multi-Category Characteristics and Unified Framework to Extract Web Content

Web Page Classification Based on Heterogeneous Features and a Combination of Multiple Classifiers.

Extracting Web Content by Exploiting Multi-Category Characteristics

Analysis and Implementation of Extraction Algorithm of Web Hierarchy Structure

Cross-Domain Learning Based Traditional Chinese Medicine Medical Record Classification.

An experimental study on large-scale web categorization.

An Editor Labeling Model for Training Set Expansion in Web Categorization

From Web Directories to Ontologies: Natural Language Processing Challenges

An Integrated System for Building Enterprise Taxonomies

A Web Site Mining Algorithm Using the Multiscale Tree Representation Model

Leveraging World Knowledge in Chinese Text Classification

Webly-Supervised Fine-Grained Visual Categorization Via Deep Domain Adaptation.

New Automatic Categorization Algorithm for Chinese Homepages

ABCF: an Adaptive Balanced Multimodal Website Classification Framework

An approach for webpage classification based on kinship-relationship knowledge network

KACTL: knowware based automated construction of a treelike library from web documents

Two-phase Web Site Classification Based on Hidden Markov Tree Models.

Measuring Similarity Of Chinese Web Databases Based On Category Hierarchy

Exploiting Textual and Visual Features for Image Categorization

Research on Methods of Parsing and Classification of Internet Super Large-scale Texts