An Analysis of URLs Generated from JavaScript Code

Jingyu Zhou,Yu Ding
DOI: https://doi.org/10.1109/icis.2012.28
2012-01-01
Abstract:Search engines use a crawling system to recursively download web pages, analyze HTML pages, and generate a new list of URLs to crawl. As web pages are becoming more dynamic than before, JavaScript is heavily used, which poses a great challenge for the crawling system, because now many URLs are embedded in the JavaScript code and are invisible to the crawler. Worse, there is no study on the usage patterns of these URLs and the impact of JavaScript-generated URLs is unknown. We propose a browser emulation method to study the usage of URLs from JavaScript code. In order to find these URLs, we instrument a browser core to output all URLs inside a web page, including those generated from JavaScript. Then we classify these URLs into a number of types and study reasons that web developers put them in JavaScript. We analyze top Internet sites and popular web pages. The results show that more than half of them contain URLs generated from JavaScript, which accounts for about 6-19% of total URLs. Among them, 26-41% refer to potential important contents that should be indexed by search engine crawlers, and advertising URLs are about 26-35%.
What problem does this paper attempt to address?