Adaptively Extracting Structured Data from Web Pages

Yingnan Guo,Jiajun Zhang,Xing Chen
DOI: https://doi.org/10.1109/ISPA-BDCloud-SustainCom-SocialCom48970.2019.00221
2019-01-01
Abstract:Web pages contain a large amount of valuable information and resources, meanwhile may update at any time. However, the current Web-data extraction algorithms are generally targeted at specific web page structure. When web pages update, the problem which is caused by the changes of web pages may be encountered, leading to the inability to extract web page information or wrong information. In order to solve this problem, this paper proposes a new method to extract the feature values of each area in the web page through page rendering, and then combine the DOM tree structure of the page, semantic similarity and other information, so that it can still extract the target data correctly after the structure of the web page changes.
What problem does this paper attempt to address?