Web Information Segmentation Method Based on DOM Structure Tree

ZHOU Jian,TANG Jin,LUO Bin
DOI: https://doi.org/10.3969/j.issn.1006-2475.2013.10.056
2013-01-01
Abstract:Correct extraction and segmentation of Web information is significant to text information mining.The paper proposes and achieves a method which can get informative information from Web page and be able to follow the correct segmentation of the original text.The method first uses page layout tag < table > and < div > to build a DOM structure tree,and then uses the nested relations of the layout label,that the DOM structure tree reflects to choose the content blocks,extract text information correctly,and finally achieves information segment of the body through the manipulation of some special tags.The experimental results prove that this method is easy to realize and high efficiency and it can automatically extract informative message and section accurately.
What problem does this paper attempt to address?