Web Information Extraction Based on Probabilistic Model

Jing WANG,Zhi-Jing LIU
DOI: https://doi.org/10.3969/j.issn.1003-6059.2010.06.016
2010-01-01
Abstract:According to the structure and the content features of web pages, a model named tree-structured hierarchical conditional random fields (TH-CRFs) is proposed. Firstly, a multi-feature vector space model is proposed to represent the features of the web pages from the facets of the page structure and the content. Secondly, the Boolean model and multi-rules are introduced to denote the features for a better representation of the web objects. Thirdly, an optimal web objects information extraction based on the TH-CRFs is performed to find out the recruitment knowledge and optimize the efficiency of the training. Finally, the proposed model is compared with the existing approaches for web objects information extraction. The experimental results show that the accuracy of the TH-CRFs for the web objects information extraction is significantly improved, and the time complexity is decreased.
What problem does this paper attempt to address?