Template-Independent Web Object Extraction

Zaiqing Nie,Fei Wu,Ji-Rong Wen,Wei-Ying Ma
2006-01-01
Abstract:There are various kinds of objects embedded in static Web pages and online Web databases. Extracting and integrating these ob- jects from the Web is of great significance for Web data manage- ment. The existing Web information extraction (IE) techniques cannot provide satisfactory solution to the Web object extraction task since objects of the same type are distributed in diverse Web sources, whose structures are highly heterogeneous. The classic information extraction (IE) methods, which are designed for pro- cessing plain text documents, also fail to meet our requirements. In this paper, we propose a novel approach called Object-Level Information Extraction (OLIE) to extract Web objects. This ap- proach extends a classic IE algorithm, Conditional Random Fields (CRF), by adding Web-specific information. It is essentially a com- bination of Web IE and classic IE. Specifically, visual information on the Web pages is used to select appropriate atomic elements for extraction and also to distinguish attributes, and structured information from external Web databases is applied to assist the extraction process. The experimental results show OLIE can sig- nificantly improve the Web object extraction accuracy.
What problem does this paper attempt to address?