A Duplicate Web Entity Identification Approach Based on Iterative Training

刘伟,肖建国
DOI: https://doi.org/10.3778/j.issn.1673-9418.2010.07.003
2010-01-01
Abstract:A large number of Web data sources that can be accessed online make users convenient to obtain their desired information.As the necessary step in Web data integration,the duplicate Web entities with various presentations should be identified accurately from Web data sources.To the best of our knowledge,previous works focus on this issue only between two data sources.The large quantity of Web data sources make these approaches unpractical. To this end,an effective iterative-training-based approach is proposed to address this issue of duplicate Web entity identification,which can be applied to multiple Web data sources using a small training set.The extensive experiments on book domain and computer domain validate the effectiveness of the proposed approach.
What problem does this paper attempt to address?