Web Information Extraction Based on Clustering GHMM

Yongxin Liu,Zhijing Liu
DOI: https://doi.org/10.1109/ISCID.2008.189
2008-01-01
Abstract:The web pages which are from different sources of network have different form and style. So it is difficult to obtain optimal model by learning from hybrid training pages. In order to improve the accuracy of information extraction, a new approach based on clustering generalized hidden Markov model was proposed. In this approach, the clustering algorithm was applied to web information extraction. The training pages were segregated into a number of clusters by using simple agglomerative hierarchical K-Means clustering (SAHKC) algorithm, and generalized hidden Markov model was trained out through every cluster. Experiment results shows that the new approach could improve the performance of extraction effectively.
What problem does this paper attempt to address?