Abstract:A key problem of retrieving, integrating and mining rich and high quality information from massive Deep Web Databases (WDBs) online is how to automatically and effectively discover and recognize domain-specific WDBs' entry points, i.e., searchable forms, in the Web. It has been a challenging task because domain-specific WDBs' forms with dynamic and heterogeneous properties are very sparsely distributed over several trillion Web pages. Although significant efforts have been made to address the problem and its special cases, more intelligent and effective solutions remain to be further explored. In this paper, a new architecture of an intelligent agent-based crawler (iCrawler) for domain-specific Deep Web databases has been proposed to address the limitations of the existing methods. The iCrawler, based on intelligent learning agents and domain ontology, and a series of novel and effective strategies, including a two-step page classifier, a link scoring strategy, etc, can improve the performance of the existing methods. Experiments of the iCrawler over a number of real Web pages in a set of representative domains have been conducted and the results show that the iCrawler outperforms the existing domain-specific Deep Web Form-Focused Crawlers (FFCs) in terms of the harvest rate, coverage rate and time performance.

Using Classifiers to Find Domain-Specific Online Databases Automatically

DEEP WEB DATA SOURCES CLASSIFICATION BASED ON TEXT VSM OF QUERY INTERFACE

Research on Deep Web Classification Based on Domain Feature Text

DeepSearcher: A One-Time Searcher for Deep Web

Automatic Classification of Deep Web Sources Based on Search Interface Schemas

Domain-Specific Deep Web Sources Discovery

Object-Extraction-Based Hidden Web Information Retrieval

Focused Deep Web Entrance Crawling by Form Feature Classification

Automatic Classification Of Deep Web Databases With Simple Query Interface

Identifying query interfaces of deep web entries automatically

Efficient Selection and Integration of Hidden Web Database.

On the Research and Design of Deep Web Crawler

A New Architecture of an Intelligent Agent-Based Crawler for Domain-Specific Deep Web Databases

Research on Automatic Classification System to Build Domain- Specific Search Engines

Classfication of Deep Web Databases Based on the Domain Sample Query

A Method to Automatically Discover and Classify Deep Web Data Source Using Multi-Classifier

Domain-independent Classification for Deep Web Interfaces

Understanding the Search Interfaces of the Deep Web Based on Domain Model

Design and Implementation of Domain based Semantic Hidden Web Crawler

Automatic Judgment of Deep Web Query Interfaces

Optimal Algorithms for Crawling a Hidden Database in the Web