Abstract:Most of Web information extraction systems work with the DOM tree–based structured extraction rules to extract data records from Web pages; however, some, data items of, or even whole, of these data records are often in a semi-structured or unstructured text form. Thus, we need to introduce text data extraction rules to further extract the fine-grained data elements from those coarse-grained text items or records. However, generating text data extraction rules is a challenging task in either manual or automated way. In this paper, we propose an unsupervised learning approach to automatically deducing text data extraction rules from a small sample of text records. First of all, to prepare for extraction rule template deduction, we propose an iterative center core multiple sequence alignment method to align text columns in sample text records. Then, we propose an information entropy model based on the statistical features of text columns to further identify each column as either a template column or a data column. From identified template and data columns, plus some additional processing, we can quickly deduce the template, that is, the text data extraction rule. Eventually, we can use the text data extraction rule to perform the automated text data extraction from test text records. This unsupervised learning approach does not need any manual labeling and enables automated generation of text data extraction rules and text data extraction process. It is the first study effort toward the unsupervised small sample learning approach for automated text data extraction rule generation. The experimental results show that our approach achieves high accuracy.

Novel semi-supervised text entity information extraction method

Semi-Supervised Mesh Segmentation and Labeling

Label-Free Distant Supervision for Relation Extraction via Knowledge Graph Embedding.

Entity Relationship Extraction Based on Bi-LSTM and Attention Mechanism

Entity relationship extraction method, entity relationship learning model acquisition method and equipment

Semi-supervised Label Enhancement Via Structured Semantic Extraction

A Supervised Learning Approach to Entity Search

A Novel Chinese Entity Relationship Extraction Method Based on the Bidirectional Maximum Entropy Markov Model

Relation Extraction Method Combining Clause Level Distant Supervision and Semi-supervised Ensemble Learning

Multi-language entity relationship extraction method and system based on adversarial training mechanism

Chinese Medical Entity Annotation Based on Autonomous Learning.

An Entity-Relation Joint Extraction Method Based on Two Independent Sub-Modules From Unstructured Text

Automated Text Data Extraction Based on Unsupervised Small Sample Learning

A Web Semantic-Based Text Analysis Approach for Enhancing Named Entity Recognition Using PU-Learning and Negative Sampling

Shatter and Gather: Learning Referring Image Segmentation with Text Supervision

Tell me your position: Distantly supervised biomedical entity relation extraction using entity position marker

A Novel Document-Level Relation Extraction Method Based on BERT and Entity Information

Distantly Supervised Named Entity Recognition Using Positive-Unlabeled Learning.

Semi-Supervised Text Classification Using Positive and Unlabeled Data

Utilizing Entity-Based Gated Convolution and Multilevel Sentence Attention to Improve Distantly Supervised Relation Extraction

Self-Teaching Semantic Annotation Method for Knowledge Discovery from Text.