The research and implementation of web information extraction technology based on multi-level pages

hengyu lai,yifei wei,yali wang,mei song,xiaojun wang
DOI: https://doi.org/10.1049/cp.2014.0701
2013-01-01
Abstract:With the development of Internet, online information becomes more and more rich and complex, how to extract target information on multi-level webs and re-construct a form of structured data is worth investigating. This paper puts forward two methods of web information extraction. The first method is width priority analysis method based on regular expressions, which is more flexible and applicable to all regular data. The second method is depth priority analysis method based on DOM tree, which is easier to implement and applicable to HTML structured data. The proposed methods are implemented and the performance is tested through the extraction of TV program information on yahoo website.
What problem does this paper attempt to address?