Ontology-Based Two-Phase Semi-Automatic Web Extracting

高军,王腾蛟,杨冬青,唐世渭
DOI: https://doi.org/10.3321/j.issn:0254-4164.2004.03.004
2004-01-01
Chinese Journal of Computers
Abstract:The massive information on the Web has become an important information source for people. How to extract information from semi structured or unstructured HTML pages receives much attention. However, the original intention of web pages is not to be processed by application automatically, but to be browsed by humans. It is difficult to design a precise web data wrapper with high applicability. Roughly, existing methods can be classified into interactive based wrapper generation and automatically wrapper generation, but the former method lacks applicability while the latter method lacks the precision of extraction. This paper proposes a novel two phase semi automatically precise web extracting method. The method tries to reduce the interactive work in wrapper generation process as much as possible while maintain the precision of extraction result at the same time. In addition, with the increase of extracted web pages, the automaticity in the process will also be improved. Compared with the existing methods, the method proposed in this paper takes both the precision of query result and the applicability of wrapper into account. The method has been validated in authors’ prototype, which has extracted 1,200 thousand web pages successfully.
What problem does this paper attempt to address?