Abstract:A key problem of retrieving, integrating and mining rich and high quality information from massive Deep Web Databases (WDBs) online is how to automatically and effectively discover and recognize domain-specific WDBs’ entry points, i.e., forms, in the Web. It has been a challenging task because domain-specific WDBs’ forms with dynamic and heterogeneous properties are very sparsely distributed over several trillion Web pages. Although significant efforts have been made to address the problem and its special cases, more effective solutions remain to be further explored towards achieving both the satisfactory harvest rate and coverage rate of domain-specific WDBs’ forms simultaneously. In this paper, an Enhanced Form-Focused Crawler for domain-specific WDBs (E-FFC) has been proposed as a novel framework to address existing solutions’ limitations. The E-FFC, based on the divide and conquer strategy, employs a series of novel and effective strategies/algorithms, including a two-step page classifier, a link scoring strategy, classifiers for advanced searchable and domain-specific forms, crawling stopping criteria, etc. to its end achieving the optimized harvest rate and coverage rate of domain-specific WDBs’ forms simultaneously. Experiments of the E-FFC over a number of real Web pages in a set of representative domains have been conducted and the results show that the E-FFC outperforms the existing domain-specific Deep Web Form-Focused Crawlers in terms of the harvest rate, coverage rate and crawling robustness.

Deep Web Sources Focused Crawling.

Domain-Specific Deep Web Sources Discovery

SmartCrawler: A Two-Stage Crawler for Efficiently Harvesting Deep-Web Interfaces

Learning to Crawl Deep Web.

Efficient Deep Web Crawling Using Reinforcement Learning

Focused Deep Web Entrance Crawling by Form Feature Classification

Adaptive agriculture deep web sources discovery

Learning Deep Web Crawling with Diverse Features

Learning To Surface Deep Web Content

Crawling Deep Web with two configurations

Research on Discovering Deep Web Entries

A survey of search technologies in Deep Web

A New Architecture of an Intelligent Agent-Based Crawler for Domain-Specific Deep Web Databases

An Adaptive Focused Crawling Algorithm Based on Link and Content Analysis

Research on WatiJ-based Spider for Deep Web

Website Crawling for Specific Topics

Research and Realization of Intelligent Focused Web Crawler

New Focused Crawling Algorithm

Deep Web adaptive crawling based on minimum executable pattern

An Effective Schema Extraction Algorithm On The Deep Web

E-FFC: an enhanced form-focused crawler for domain-specific deep web databases