Abstract:A key problem of retrieving, integrating and mining rich and high quality information from massive Deep Web Databases (WDBs) online is how to automatically and effectively discover and recognize domain-specific WDBs’ entry points, i.e., forms, in the Web. It has been a challenging task because domain-specific WDBs’ forms with dynamic and heterogeneous properties are very sparsely distributed over several trillion Web pages. Although significant efforts have been made to address the problem and its special cases, more effective solutions remain to be further explored towards achieving both the satisfactory harvest rate and coverage rate of domain-specific WDBs’ forms simultaneously. In this paper, an Enhanced Form-Focused Crawler for domain-specific WDBs (E-FFC) has been proposed as a novel framework to address existing solutions’ limitations. The E-FFC, based on the divide and conquer strategy, employs a series of novel and effective strategies/algorithms, including a two-step page classifier, a link scoring strategy, classifiers for advanced searchable and domain-specific forms, crawling stopping criteria, etc. to its end achieving the optimized harvest rate and coverage rate of domain-specific WDBs’ forms simultaneously. Experiments of the E-FFC over a number of real Web pages in a set of representative domains have been conducted and the results show that the E-FFC outperforms the existing domain-specific Deep Web Form-Focused Crawlers in terms of the harvest rate, coverage rate and crawling robustness.

A Vision-Based Approach for Deep Web Form Extraction.

Effective Approach to Deep Web Entries Identification

Deep Web Data Extraction Based on Visual Information Processing.

ViDE: A Vision-Based Approach for Deep Web Data Extraction

Vision-based Deep Web result schema automatic extraction

Vision-based Deep Web query interfaces automatic extraction

Attributes extraction of Deep Web query interface based on DOM

Vision-based Web Data Records Extraction.

Web Form Entrance Detection and Automatic Form Filling

Advanced Deep Web Crawler Based on Dom

Combining Vision Information and Tag Information to Extract Deep Web Result Pages Content

Automatic Filling Forms of Deep Web Entries Based on Ontology

Focused Deep Web Entrance Crawling by Form Feature Classification

E-FFC: an enhanced form-focused crawler for domain-specific deep web databases

Ltde: A Layout Tree Based Approach For Deep Page Data Extraction

Understanding the Search Interfaces of the Deep Web Based on Domain Model

A Deep Web Query Interface Discovery Method

Web Page Content Extraction Based on Multi-feature Fusion

FreeDOM: A Transferable Neural Architecture for Structured Information Extraction on Web Documents

Research on Discovering Deep Web Entries

Deep Visual Template-Free Form Parsing