Extraction Rule Language for Web Information Extraction and Integration

Wu Wei,Shengsheng Shi,Yulong Liu,Haitao Wang,Chunfeng Yuan,Yihua Huang
DOI: https://doi.org/10.1109/WISA.2013.21
2013-01-01
Abstract:The Web is the largest data source that contains a lot of valuable information of interests to users or applications. However, how to automatically navigate and extract useful data from web pages is an important issue to study. There have been a number of existing studies on this area. However, most of them do not take enough consideration on complete process of web information extraction and lack of powerful rule expression ability to describe the navigation, extraction and integration rules. In this paper, we study and propose a new web information extraction rule language toward a general model for web information extraction and integration. We first introduce a source data objects to extract different type of web data records. Then we adopt the XML to define the target data entity structure and use scripts to perform target data record integration. The results show that our extraction rule language can provide powerful and flexible ability to describe extraction logic to achieve accurate web data records extraction from complex web pages.
What problem does this paper attempt to address?