Abstract:Automatic discovery of isolated land cover web map services (LCWMSs) can potentially help in sharing land cover data. Currently, various search engine-based and crawler-based approaches have been developed for finding services dispersed throughout the surface web. In fact, with the prevalence of geospatial web applications, a considerable number of LCWMSs are hidden in JavaScript code, which belongs to the deep web. However, discovering LCWMSs from JavaScript code remains an open challenge. This paper aims to solve this challenge by proposing a focused deep web crawler for finding more LCWMSs from deep web JavaScript code and the surface web. First, the names of a group of JavaScript links are abstracted as initial judgements. Through name matching, these judgements are utilized to judge whether or not the fetched webpages contain predefined JavaScript links that may prompt JavaScript code to invoke WMSs. Secondly, some JavaScript invocation functions and URL formats for WMS are summarized as JavaScript invocation rules from prior knowledge of how WMSs are employed and coded in JavaScript. These invocation rules are used to identify the JavaScript code for extracting candidate WMSs through rule matching. The above two operations are incorporated into a traditional focused crawling strategy situated between the tasks of fetching webpages and parsing webpages. Thirdly, LCWMSs are selected by matching services with a set of land cover keywords. Moreover, a search engine for LCWMSs is implemented that uses the focused deep web crawler to retrieve and integrate the LCWMSs it discovers. In the first experiment, eight online geospatial web applications serve as seed URLs (Uniform Resource Locators) and crawling scopes; the proposed crawler addresses only the JavaScript code in these eight applications. All 32 available WMSs hidden in JavaScript code were found using the proposed crawler, while not one WMS was discovered through the focused crawler-based approach. This result shows that the proposed crawler has the ability to discover WMSs hidden in JavaScript code. The second experiment uses 4842 seed URLs updated daily. The crawler found a total of 17,874 available WMSs, of which 11,901 were LCWMSs. Our approach discovered a greater number of services than those found using previous approaches. It indicates that the proposed crawler has a large advantage in discovering LCWMSs from the surface web and from JavaScript code. Furthermore, a simple case study demonstrates that the designed LCWMS search engine represents an important step towards realizing land cover information integration for global mapping and monitoring purposes.

An Analysis of URLs Generated from JavaScript Code

Extracting URLs from JavaScript via program analysis.

An Efficient Valid Page Crawling Approach for Websites with Dynamic Scripts

An analysis of the dynamic behavior of JavaScript programs

System to Identify and Elide Superfluous JavaScript Code for Faster Webpage Loads

The Ever-Changing Labyrinth: A Large-Scale Analysis Of Wildcard Dns Powered Blackhat Seo

Automatically Crawling Dynamic Web Applications Via Proxy-Based JavaScript Injection and Runtime Analysis.

Abusing Browser Address Bar for Fun and Profit - An Empirical Investigation of Add-On Cross Site Scripting Attacks.

Looking for Criminal Intents in JavaScript Obfuscated Code

Malicious JavaScript Code Detection Based on Hybrid Analysis

Crawling web pages with application in online advertises monitoring system

Detection and analysis of malicious JavaScript code based on pre-filter

Discovering Land Cover Web Map Services from the Deep Web with JavaScript Invocation Rules.

User browsing behavior-driven web crawling.

Web Search Engine: Characteristics of User Behaviors and Their Implication

A Brief History of Web Crawlers

The Evolution of Link-Attributes for Pages and Its Implications on Web Crawling

Unveiling Web Fingerprinting in the Wild Via Code Mining and Machine Learning

Classifier-Guided Topical Crawler: A Novel Method of Automatically Labeling the Positive URLs

Statically Detecting JavaScript Obfuscation and Minification Techniques in the Wild