CTextEM: Employing Compound Textual Information in Entity Matching

Qiang Yang,Zhixu Li,Binbin Gu,An Liu,Pengpeng Zhao,Guanfeng Liu,Lei Zhao
2015-01-01
Abstract:Entity Matching (EM) identifies records referring to the same entity within or across databases. Existing EM methods measuring the similarities between structured attribute values (such as digital, date or short string values) may fail when the structured information is not enough to reflect the matching relationships between records. Nowadays more and more data sets have some unstructured textual attribute containing extra compound textual information (or what we call as CText) of the record, but seldom work has been done on using the information for EM. Conventional string similarity metrics such as edit distance or bag-of-words are unsuitable for measuring the similarities between CTexts since there are hundreds or thousands of words with each CText, while existing topic models either can not work well since there is no obvious gaps between the various topics in CText. In this paper, we work on employing CText in EM. We not only propose a novel cooccurrencebased topic model to identify various topics of each CText such that to measure the similarity between CTexts on multiple topic dimensions, but also find ways to decrease the high cost of employing CText in EM from O(n+ e 2 ) to O(n+ e 2 ). Our empirical study shows that our method outperforms several previous methods and baselines on reaching a higher EM precision & recall, and can greatly improve the EM efficiency by more than 60% on several real data collections.
What problem does this paper attempt to address?