Web content extraction method based on text feature value

Chuan MENG,Xiaonian WU
DOI: https://doi.org/10.3969/j.issn.1673-808X.2017.02.005
2017-01-01
Abstract:In view of poor universality and low accuracy of the existing Web text extraction methods, a text extraction method based on text feature value is proposed.Firstly codes of Web pages are preprocessed, and the preprocessed codes are converted into the DOM tree.Then through traversing the DOM tree, the text feature value of each DOM tree node is calculated based on the text length and punctuation weight of node, and the standard deviation is used to eliminate noise as much as possible.Gauss function is used to smooth the text feature value of nodes, ease the mutation of text feature value, and eventually reduce the possible loss of short text node.The experimental results show that the presented method does not rely on the label, need not training data, and has good versatility and high accuracy.
What problem does this paper attempt to address?