Toward a view-based data cleaning architecture

Toshiyuki Shimizu,Hiroki Omori,Masatoshi Yoshikawa
DOI: https://doi.org/10.48550/arXiv.1910.11040
2019-10-24
Abstract:Big data analysis has become an active area of study with the growth of machine learning techniques. To properly analyze data, it is important to maintain high-quality data. Thus, research on data cleaning is also important. It is difficult to automatically detect and correct inconsistent values for data requiring expert knowledge or data created by many contributors, such as integrated data from heterogeneous data sources. An example of such data is metadata for scientific datasets, which should be confirmed by data managers while handling the data. To support the efficient cleaning of data by data managers, we propose a data cleaning architecture in which data managers interactively browse and correct portions of data through views. In this paper, we explain our view-based data cleaning architecture and discuss some remaining issues.
Databases
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in big - data analysis, how to efficiently process and clean the data that requires professional knowledge or is created by multiple contributors. These data often contain inconsistent values and are difficult to detect and correct by automated rules or machine - learning methods. Specifically, the paper focuses on the cleaning problem of scientific metadata. This type of data describes the details of scientific data sets, such as data - set names, categories, data providers, etc., and is usually filled in by the person in charge of the data set. Therefore, inconsistencies are likely to occur due to different fillers. The paper proposes a view - based data - cleaning architecture, aiming to support data managers to improve data quality by interactively browsing and correcting part of the data.