Abstract:In recent years, with the trend of open science, there have been many efforts to share research data on the internet. To promote research data sharing, data curation is essential to make the data interpretable and reusable. In research fields such as life sciences, earth sciences, and social sciences, tasks and procedures have been already developed to implement efficient data curation to meet the needs and customs of individual research fields. However, not only data sharing within research fields but also interdisciplinary data sharing is required to promote open science. For this purpose, knowledge of data curation across the research fields is surveyed, analyzed, and organized as an ontology in this paper. As the survey, existing vocabularies and procedures are collected and compared as well as interviews with the data curators in research institutes in different fields are conducted to clarify commonalities and differences in data curation across the research fields. It turned out that the granularity of tasks and procedures that constitute the building blocks of data curation is not formalized. Without a method to overcome this gap, it will be challenging to promote interdisciplinary reuse of research data. Based on the analysis above, the ontology for the data curation process is proposed to describe data curation processes in different fields universally. It is described by OWL and shown as valid and consistent from the logical viewpoint. The ontology successfully represents data curation activities as the processes in the different fields acquired by the interviews. It is also helpful to identify the functions of the systems to support the data curation process. This study contributes to building a knowledge framework for an interdisciplinary understanding of data curation activities in different fields.

Toward a view-based data cleaning architecture

An Ontology-Based Approach to Data Cleaning

Human-Centric Data Cleaning [Vision]

Distance-based Data Cleaning: A Survey (Technical Report)

A View-based Programmable Architecture for Controlling and Integrating Decentralized Data

An Open Data Cleaning Framework Based on Semantic Rules for Continuous Auditing

A Survey on Data Cleaning Methods for Improved Machine Learning Model Performance

VisClean

Exploring Artificial Intelligence Architecture in Data Cleaning Based on Bayesian Networks

A study on formalizing the knowledge of data curation activities across different fields

Data Cleaning for Accurate, Fair, and Robust Models: A Big Data - AI Integration Approach

Batchwise Probabilistic Incremental Data Cleaning

Making View Update Strategies Programmable - Toward Controlling and Sharing Distributed Data -

Cleaning Relations Using Knowledge Bases.

AI-Driven Frameworks for Enhancing Data Quality in Big Data Ecosystems: Error_Detection, Correction, and Metadata Integration

A Primer on the Data Cleaning Pipeline

Data Cleaning and Machine Learning: A Systematic Literature Review

Semantic-based Intelligent Data Clean Framework for Big Data

Data Cleaning Model for XML Datasets using Conditional Dependencies

Horizon: Scalable Dependency-driven Data Cleaning