Abstract:The simplification of key tasks of search engine users by directly returning structured knowledge according to their query intents has attracted much attention from both the industry and the academia. The challenge lies in automatically extracting structured knowledge from noisy and complex web scale websites. Although various automatic wrapper induction algorithms have been proposed, ineffectiveness or inefficiency issues beset many of their web scale applications. In this paper, we propose an unsupervised automatic wrapper induction algorithm, named SKES, to efficiently extract knowledge from semi-structured websites. SKES induces the wrapper in a divide-and-conquer mode; dividing the general wrapper into sub-wrappers that can independently learn from data, making it efficient and easy to implement in a parallel mode. Moreover, by employing techniques such as tag path representation of web pages, SKES can dramatically reduce the number of tags and naturally differentiate their roles. The proposed solution was applied and evaluated on a large number of real websites as well as compared with two existing methods that are most related to it. The proposed method is much more efficient than the existing methods, and provided high extraction accuracy. We have extracted 2.5million entities and 29million data fields from over 10 thousand high traffic websites, which demonstrates the applicability of this method. Furthermore, based on the automatically extracted data, we built a prototype to serve structured knowledge that simplifies the key search tasks of end users. The feedback received for the prototype was highly positive.

A Simple Semantic Web Crawler for Intelligent Information Retrieval from Academic Websites

A Semantic Focused Web Crawler Based on a Knowledge Representation Schema

Wih - The Web Information Collecting System

Semantic Information Retrieval Using Ontology In University Domain

AI - Based Solution for Web Crawling

An Enhanced Semantic Focused Web Crawler Based on Hybrid String Matching Algorithm

A Rule-Based Information Extraction System for Human-Readable Semi-Structured Scientific Documents

Semantic Web Approach towards Interoperability and Privacy issues in Social Networks

Entrez Neuron RDFa: a pragmatic semantic web application for data integration in neuroscience research.

Information Retrieval (IR) through Semantic Web (SW): An Overview

Web Crawler and Web Crawler Algorithms: A Perspective

Web Crawler: Design And Implementation For Extracting Article-Like Contents

A Semantic and Optimized Focused Crawler Based on Semantic Graph and Genetic Algorithm

Design and Implementation of Domain based Semantic Hidden Web Crawler

LEARNING-based Focused WEB Crawler

PYTHON-POWERED DATA ANALYSIS THROUGH WEB SCRAPING

Semantic Query Optimisation with Ontology Simulation

Scalable and Noise Tolerant Web Knowledge Extraction for Search Task Simplification.

Exploiting Semantic Linkages among Multiple Sources for Semantic Information Retrieval

A Domain Specific Ontology Based Semantic Web Search Engine

A Novel Framework for Intelligent Information Retrieval in Wireless Sensor Networks