Abstract:Most of Web information extraction systems work with the DOM tree–based structured extraction rules to extract data records from Web pages; however, some, data items of, or even whole, of these data records are often in a semi-structured or unstructured text form. Thus, we need to introduce text data extraction rules to further extract the fine-grained data elements from those coarse-grained text items or records. However, generating text data extraction rules is a challenging task in either manual or automated way. In this paper, we propose an unsupervised learning approach to automatically deducing text data extraction rules from a small sample of text records. First of all, to prepare for extraction rule template deduction, we propose an iterative center core multiple sequence alignment method to align text columns in sample text records. Then, we propose an information entropy model based on the statistical features of text columns to further identify each column as either a template column or a data column. From identified template and data columns, plus some additional processing, we can quickly deduce the template, that is, the text data extraction rule. Eventually, we can use the text data extraction rule to perform the automated text data extraction from test text records. This unsupervised learning approach does not need any manual labeling and enables automated generation of text data extraction rules and text data extraction process. It is the first study effort toward the unsupervised small sample learning approach for automated text data extraction rule generation. The experimental results show that our approach achieves high accuracy.

Template extraction from candidate template set generation: a structure and content approach.

Extracting Web Content by Exploiting Multi-Category Characteristics

Exploiting Multi-Category Characteristics and Unified Framework to Extract Web Content

A Benchmark Suite for Template Detection and Content Extraction

Tag Tree Template for Web Information and Schema Extraction.

Template detection for large scale search engines.

Web Information Segmentation Method Based on DOM Structure Tree

Web Template Extraction Based on Hyperlink Analysis

Extracting Content Structure For Web Pages Based On Visual Representation

Content Extraction of Web Pages Based on Characteristic Symbols

Effective Blog Pages Extractor for Better UGC Accessing

Automatic Extraction of English-Chinese Translation Templates Based on Deep Learning

A cognitive crawler using structure pattern for incremental crawling and content extraction

Template Matching and Simplification Method for Building Features Based on Shape Cognition

WebFormer: the Web-page Transformer for Structure Information Extraction

RESEARCH OF A KIND OF XML MATCHING MECHANISM

Extracting information from WEB tables based on abstract semantic model

SoC Partition Method Based on Automatic Extraction of Similar Structure

CCWrapper: adaptive predefined schema guided web extraction

Page-Level Main Content Extraction From Heterogeneous Webpages

Automated Text Data Extraction Based on Unsupervised Small Sample Learning