Extracting Content For News Web Pages Based On Dom

Hua Geng,Qiang Gao,Jingui Pan
2007-01-01
Abstract:Nowadays, RSS is becoming a hot topic for Web applications. A lot of famous Web sites have provided RSS for users. However, making RSS files manually is boring, and so far, most sites haven't provided such a service. In this paper, we mainly describe the design, implementation and evaluation of HTML2RSS, a system to extract content from HTML Web pages based on DOM structure, and generate RSS files automatically with the extracted content. We introduce two algorithms to extract information from semi-structured Web data. The goal of HTML2RSS is to provide users with RSS files as a substitute of the HTML pages.
What problem does this paper attempt to address?