Web Content Extraction Method Based on Logic Lines and Maximum Admitting Distances

张霞亮,陈家骏
DOI: https://doi.org/10.3778/j.issn.1002-8331.2009.25.038
2009-01-01
Computer Engineering and Applications Journal
Abstract:The content extraction for Web pages is a basic work to many Web applications and has to be solved well.The mainstream methods are based on the DOM trees and they need to parse out the DOM tree structures.For there are so many sources of Web pages in current Internet and their structures vary,the methods based on DOM trees may face the problem of low extraction precision and the shortage of performance.Aiming at these problems,this pager proposes a new method to extract the contents of Web pages.This method does not rely on DOM trees.It applies some heuristic rules formed by people's habits when writing Web pages,combined with some relevant statistics laws.It extracts the contents of Web pages by taking the logic lines as the basic process units and using maximum admitting distances to decide the final contents of Web pages.Experiments show that this method can extract Web contents quickly and accurately.
What problem does this paper attempt to address?