Abstract:In many fields, how to catch the related-topic Web resources is crucial. As a vertical search method, focused crawler has received great attention in recent years. Currently, most focused crawlers consider multiple evaluating factors of the hyperlinks and use the weighted sum approach to compute the priorities of unvisited hyperlinks. However, the proper weighted coefficients are hard to determine, and their unsuitable values may even cause the direction of crawlers to deviate seriously from the topic. To overcome this issue, this article builds a multi-objective optimization model based on Web text and link structure and designs a crawler framework called the Web space evolution (WSE), where a hyperlink bank whose radius is gradually increased is introduced to extend the search scape of crawlers in Web space. To improve the uniformity and diversity of hyperlinks, a nearest and farthest candidate solution method is combined with the fast non-dominated sorting to choose Pareto-optimal solutions (hyperlinks). A domain ontology based on the formal concept analysis is applied to establish the topic model. By incorporating the WSE and the domain ontology into the focused crawling, a novel focused crawler called FCWSEO is proposed to collect topic-relevant webpages. The experimental results on the rainstorm disaster domain show that the FCWSEO outperforms other focused crawler strategies in terms of the quantity and quality of retrieved relevant webpages.

Learning To Surface Deep Web Content

Reinforcement Learning in Deep Web Crawling: Survey

UCrawler: A learning-based web crawler using a URL knowledge base

Ranked Deep Web Page Detection Using Reinforcement Learning and Query Optimization

Reinforcement Learning based Web Crawler Detection for Diversity and Dynamics

A survey of search technologies in Deep Web

Focused Crawler Framework Based On Open Search Engine

Schema driven and topic specific web crawling

Underwater Multi-agent Cooperative Formation Hunting Based on Deep Reinforcement Learning

Deep Learning for Content-Based Image Retrieval: A Comprehensive Study

AutoScraper: A Progressive Understanding Web Agent for Web Scraper Generation

An Efficient Adaptive Focused Crawler Based on Ontology Learning

A novel focused crawler combining Web space evolution and domain ontology

A Reinforcement Learning Approach to Guide Web Crawler to Explore Web Applications for Improving Code Coverage

Towards A Quality-Oriented Real-Time Web Crawler

A novel design of hidden web crawler using ontology

CRATOR: a Dark Web Crawler

Design and Implementation of Domain based Semantic Hidden Web Crawler

LEARNING-based Focused WEB Crawler

A Reinforcement Learning Algorithm for Underwater Environment Search.

A Sample-Guided Approach to Incremental Structured Web Database Crawling