An Efficient Valid Page Crawling Approach for Websites with Dynamic Scripts

Bing Xia,Jun Gao,Wang Tengjiao,Dongqing Yang,Xia B,Gao J
2009-01-01
Abstract:In times of Web 2.0, more and more websites adopt dynamic scripts for user interaction, and the switches between pages are no longer all based on the tags and the URL is no longer the unique identification of a Web page. Traditional Web crawlers can't deal with Web pages containing dynamic scripts, as a result, search engines, such as Google, give up these Web pages. The research on crawling website with dynamic scripts is still in the early stage. This paper proposes an efficient valid page crawling approach for websites with dynamic scripts. Firstly, by training the paper can get the events and the Web elements that triggered the events, which would lead the people to desired Web pages. Then, the paper generates the XPath patterns of these elements and record the events the people need to trigger. During crawling, the paper only considers these event and element combinations for accelerating the crawling. Additionally, the paper demonstrates the efficiency and the effectiveness of the approach by extensive experimental evaluation.
What problem does this paper attempt to address?