Abstract:In the era of big data, the vast majority of the data are not from the surface Web, the Web that is interconnected by hyperlinks and indexed by most general purpose search engines. Instead, the trove of valuable data often reside in the deep Web, the Web that is hidden behind query interfaces. Since numerous applications, like data integration and vertical portals, require deep Web data, various crawling methods were developed for exhaustively harvesting a deep Web data source with the minimal (or near-minimal) cost. Most existing crawling methods assume that all the documents matched by queries are returned. In practice, data sources often return the top k matches. This makes exhaustive data harvesting difficult: highly ranked documents will be returned multiple times, while documents ranked low have small chance being returned. In this paper, we decompose this problem into two orthogonal sub-problems, i.e., query and ranking bias problems, and propose a document frequency based crawling method to overcome the ranking bias problem. The rational of our method is to use the queries whose document frequencies are within the specified range to avoid the effect of search ranking plus return limit and significantly reduce the difficulty of crawling ranked data source. The method is extensively tested on a variety of datasets and compared with two existing methods. The experimental result demonstrates that our method outperforms the two algorithms by 58 % and 90 % on average respectively.

On the Research and Design of Deep Web Crawler

SmartCrawler: A Two-Stage Crawler for Efficiently Harvesting Deep-Web Interfaces

Deep Web Sources Focused Crawling.

Query Selection in Deep Web Crawling

Learning to Crawl Deep Web.

Domain-Specific Deep Web Sources Discovery

Research and Design of Topical Crawl Module Based on Deep Web Search Technology

A New Architecture of an Intelligent Agent-Based Crawler for Domain-Specific Deep Web Databases

Crawling ranked deep Web data sources

Efficient Deep Web Crawling Using Reinforcement Learning

Advanced Deep Web Crawler Based on Dom

A survey of search technologies in Deep Web

Design of Web Crawler for Deep Web Based on ID3 Algorithm

DeepSearcher: A One-Time Searcher for Deep Web

A Deep Web Data Integration System for Book Searching Domain

An Approach to Deep Web Crawling by Sampling

Research and Realization of Intelligent Focused Web Crawler

Research on WatiJ-based Spider for Deep Web

High Performance Parallel Crawler

Learning Deep Web Crawling with Diverse Features

Research of a Traffic Advisory System Based on Deep Web