Extending Dictionary-Based Entity Extraction to Tolerate Errors.

Guoliang Li,Dong Deng,Jianhua Feng
DOI: https://doi.org/10.1145/1871437.1871616
2010-01-01
Abstract:Entity extraction (also known as entity recognition) extracts entities (e.g., person names, locations, companies) from text. Approximate (dictionary-based) entity extraction is a recent trend to improve extraction quality, which extracts substrings in text that approximately match predefined entities in a given dictionary. In this paper, we study the problem of approximate entity extraction with edit-distance constraints. A straightforward method first extracts all substrings from the text and then for each substring identifies its similar entities from the dictionary using existing methods for approximate string search. However many substrings of the text have overlaps, and we have an opportunity to utilize the shared computation across the overlaps to avoid unnecessary duplicate computations. To this end, we propose a heap-based framework to efficiently extract entities. We have implemented our techniques, and the experimental results show that our method achieves high performance and outperforms existing studies significantly.
What problem does this paper attempt to address?