Web Information Extraction Based on Repeated Pattern

GAO Qiang,ZHANG Jing-Zhi,GENG Hua,PAN Jin-Gui
DOI: https://doi.org/10.3969/j.issn.1002-137X.2007.04.057
2007-01-01
Computer Science
Abstract:In a data-rich, multiple-record Web page, the “useful and relevant” information items are usually arranged regularly and compactly, with similar pattern of HTML tags and consistent style of presentation. In other words, the semi-structured Web document often has its own structured features. Based on this observation, this paper presents an automatic approach to extract such kind of information. We use Suffix Tree to obtain the repeated patterns in HTML tags of the target page, and then a series of heuristic rules are applied to choose the appropriate patterns. The information items will be extracted from the instances of these patterns. Experimental results indicate that our approach works effectively in most cases.
What problem does this paper attempt to address?