Enhancing data preparation: insights from a time series case study

Sancricca, Camilla,Cappiello, Cinzia
DOI: https://doi.org/10.1007/s10844-024-00867-8
2024-07-26
Journal of Intelligent Information Systems
Abstract:Data play a key role in AI systems that support decision-making processes. Data-centric AI highlights the importance of having high-quality input data to obtain reliable results. However, well-preparing data for machine learning is becoming difficult due to the variety of data quality issues and available data preparation tasks. For this reason, approaches that help users in performing this demanding phase are needed. This work proposes DIANA, a framework for data-centric AI to support data exploration and preparation, suggesting suitable cleaning tasks to obtain valuable analysis results. We design an adaptive self-service environment that can handle the analysis and preparation of different types of sources, i.e., tabular, and streaming data. The central component of our framework is a knowledge base that collects evidence related to the effectiveness of the data preparation actions along with the type of input data and the considered machine learning model. In this paper, we first describe the framework, the knowledge base model, and its enrichment process. Then, we show the experiments conducted to enrich the knowledge base in a particular case study: time series data streams.
computer science, information systems, artificial intelligence
What problem does this paper attempt to address?