Extracting Content from Web Pages Using the Sliding Window

Liu Yang,Chunping Li,Ming Gu
DOI: https://doi.org/10.1109/csa.2009.5404289
2009-01-01
Abstract:Content extraction is an important technology for accessing and processing web information. In this paper, we propose a content extraction algorithm based on the sliding window. A statistical heuristic is used in the algorithm. Experiments show that our algorithm is capable of extracting most of the main content from web pages. With the simple and effective heuristic, the sliding window based algorithm shows a wide scope of application for most kinds of web pages.
What problem does this paper attempt to address?