Boosting approximate dictionary-based entity extraction with synonyms

Jin Wang,Chunbin Lin,Mingda Li,Carlo Zaniolo
DOI: https://doi.org/10.1016/j.ins.2020.04.025
IF: 8.1
2020-08-01
Information Sciences
Abstract:<p>Dictionary-based entity extraction is an important task in many data analysis applications, such as academic search, document classification, and code auto-debugging. To improve the effectiveness of extraction, many previous studies focused on the problem of approximate dictionary-based entity extraction, which aims at finding all substrings in documents that are similar to pre-defined entities in the reference entity dictionary. However, these studies only consider syntactical similarity metrics, such as Jaccard and edit distance. In real-world scenarios, there are many cases where syntactically different strings can express the same meaning. Existing approximate entity extraction work fails to identify such kind of semantic similarity and will definitely suffer from low recall.</p><p>In this paper, we come up with the new problem of an approximate dictionary-based entity extraction with synonyms and propose an end-to-end framework <span class="sans-serif">Aeetes</span> to solve it. We propose a new similarity measure <span class="sans-serif">Asymmetric Rule-based Jaccard</span> (<span class="small-caps">JaccAR</span>) to combine the synonym rules with syntactic similarity metrics and capture the semantic similarity expressed in the synonyms. We devise and implement a filter-and-verification based strategy to improve the overall efficiency. To this end, we propose several pruning techniques to reduce the filter cost and develop novel strategies to improve verification performance. Experimental results on three real-world datasets demonstrate the superior effectiveness and efficiency of <span class="sans-serif">Aeetes</span>.</p>
computer science, information systems
What problem does this paper attempt to address?