Abstract:The quality of master data is crucial for the accurate functioning of the various modules of an enterprise resource planning (ERP) system. This study addresses specific data problems arising from the generation of approximately duplicate material records in ERP databases. Such problems are mainly due to the firm's lack of unique and global identifiers for the material records, and to the arbitrary assignment of alternative names for the same material by various users. Traditional duplicate detection methods are ineffective in identifying such approximately duplicate material records because these methods typically rely on string comparisons of each field. To address this problem, a machine learning-based framework is developed to recognise semantic similarity between strings and to further identify and reunify approximately duplicate material records - a process referred to as de-duplication in this article. First, the keywords of the material records are extracted to form vectors of discriminating words. Second, a machine learning method using a probabilistic neural network is applied to determine the semantic similarity between these material records. The approach was evaluated using data from a real case study. The test results indicate that the proposed method outperforms traditional algorithms in identifying approximately duplicate material records.

A Duplicate Records Identification Model for Deep Web Data Sources

Duplicate Identification in Deep Web Data Integration.

Automatic Web-based duplicate attribute resolution method

A Holistic Solution for Duplicate Entity Identification in Deep Web Data Integration

Attributes extraction of Deep Web query interface based on DOM

Domain-oriented Deep Web Data Sources' Discovery and Identification

DEEP WEB DATA SOURCES CLASSIFICATION BASED ON TEXT VSM OF QUERY INTERFACE

Effective Approach to Deep Web Entries Identification

An Effective Schema Extraction Algorithm On The Deep Web

Duplicate Record Detection For Data Integration

Understanding the Search Interfaces of the Deep Web Based on Domain Model

A Method of Deepweb Schema Matching Based on Data Mining

Effective Schema Extraction of Query Interfaces on the Deep Web

An approach for deep web interface schema extraction based on hierarchical semantic annotation

A Duplicate Web Entity Identification Approach Based on Iterative Training

Ontology-Based Annotation for Deep Web Data

Hybrid Schema Matching For Deep Web

Data source selection with similar theme in Deep Web integrated system

Identification of Approximately Duplicate Material Records in ERP Systems

A Survey Of Approaches To Large-Scale Schema Matching Off Deep Web

Towards building the domain model for applications of deep web