Deep Sequence-to-Sequence Entity Matching for Heterogeneous Entity Resolution
Hao Nie,Xianpei Han,Ben He,Le Sun,Bo Chen,Wei Zhang,Suhui Wu,Hao Kong
DOI: https://doi.org/10.1145/3357384.3358018
2019-11-03
Abstract:Entity Resolution (ER) identifies records from different data sources that refer to the same real-world entity. Conventional ER approaches usually employ a structure matching mechanism, where attributes are aligned, compared and aggregated for ER decision. The structure matching approaches, unfortunately, often suffer from heterogeneous and dirty ER problems. That is, entities from different data sources are described using different schemas, and attribute values may be misplaced, missing, or noisy. In this paper, we propose a deep sequence-to-sequence entity matching model, denoted Seq2SeqMatcher, which can effectively solve the heterogeneous and dirty problems by modeling ER as a token-level sequence-to-sequence matching task. Specifically, we propose an align-compare-aggregate neural network for Seq2Seq entity matching, which can learn the representations of tokens, capture the semantic relevance between tokens, and aggregate matching evidence for accurate ER decisions in an end-to-end manner. Experimental results show that, by comparing entity records in token level and learning all components in an end-to-end manner, our Seq2Seq entity matching model can achieve remarkable performance improvements on 9 standard entity resolution benchmarks.