Abstract:Dictionary-based entity extraction is an important task in many data analysis applications, such as academic search, document classification, and code auto-debugging. To improve the effectiveness of extraction, many previous studies focused on the problem of approximate dictionary-based entity extraction, which aims at finding all substrings in documents that are similar to pre-defined entities in the reference entity dictionary. However, these studies only consider syntactical similarity metrics, such as Jaccard and edit distance. In real-world scenarios, there are many cases where syntactically different strings can express the same meaning. Existing approximate entity extraction work fails to identify such kind of semantic similarity and will definitely suffer from low recall.In this paper, we come up with the new problem of an approximate dictionary-based entity extraction with synonyms and propose an end-to-end framework Aeetes to solve it. We propose a new similarity measure Asymmetric Rule-based Jaccard (JaccAR) to combine the synonym rules with syntactic similarity metrics and capture the semantic similarity expressed in the synonyms. We devise and implement a filter-and-verification based strategy to improve the overall efficiency. To this end, we propose several pruning techniques to reduce the filter cost and develop novel strategies to improve verification performance. Experimental results on three real-world datasets demonstrate the superior effectiveness and efficiency of Aeetes.

A Technical Report: Entity Extraction Using Both Character-based and Token-based Similarity

2ED: An Efficient Entity Extraction Algorithm Using Two-Level Edit-Distance

Reserch of Entity Matching Based on Multiple Heterogenous Data

A Unified Framework for Approximate Dictionary-Based Entity Extraction.

Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction.

An Efficient Trie-based Method for Approximate Entity Extraction with Edit-Distance Constraints

Boosting approximate dictionary-based entity extraction with synonyms

Document similarity measure based on named entity

Entity matching: how similar is similar

Extending Dictionary-Based Entity Extraction to Tolerate Errors.

Entity Extraction with Knowledge from Web Scale Corpora

Entity Matching Across Heterogeneous Sources

Crowd-Guided Entity Matching with Consolidated Textual Data

CTextEM: Using Consolidated Textual Data for Entity Matching

Multi-Context Attention for Entity Matching.

CTextEM: Employing Compound Textual Information in Entity Matching

Entity disambiguation with context awareness in user-generated short texts

Statistical Entity Extraction From the Web.

Efficient Entity Translation Mining

A New Entity Extraction Method Based on Machine Reading Comprehension

A Pattern-Based Method for Medical Entity Recognition From Chinese Diagnostic Imaging Text