Deep Neural Networks for Web Page Information Extraction

Tomas Gogar,Ondrej Hubacek,Jan Sedivy
DOI: https://doi.org/10.1007/978-3-319-44944-9_14
2016-01-01
Abstract:Web wrappers are systems for extracting structured information from web pages. Currently, wrappers need to be adapted to a particular website template before they can start the extraction process. In this work we present a new method, which uses convolutional neural networks to learn a wrapper that can extract information from previously unseen templates. Therefore, this wrapper does not need any site-specific initialization and is able to extract information from a single web page. We also propose a method for spatial text encoding, which allows us to encode visual and textual content of a web page into a single neural net. The first experiments with product information extraction showed very promising results and suggest that this approach can lead to a general site-independent web wrapper.
What problem does this paper attempt to address?