A survey of open-source data quality tools: shedding light on the materialization of data quality dimensions in practice

Vasileios Papastergios,Anastasios Gounaris
2024-07-26
Abstract:Data Quality (DQ) describes the degree to which data characteristics meet requirements and are fit for use by humans and/or systems. There are several aspects in which DQ can be measured, called DQ dimensions (i.e. accuracy, completeness, consistency, etc.), also referred to as characteristics in literature. ISO/IEC 25012 Standard defines a data quality model with fifteen such dimensions, setting the requirements a data product should meet. In this short report, we aim to bridge the gap between lower-level functionalities offered by DQ tools and higher-level dimensions in a systematic manner, revealing the many-to-many relationships between them. To this end, we examine 6 open-source DQ tools and we emphasize on providing a mapping between the functionalities they offer and the DQ dimensions, as defined by the ISO standard. Wherever applicable, we also provide insights into the software engineering details that tools leverage, in order to address DQ challenges.
Databases
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how data quality (DQ) tools establish connections with theoretically data quality dimensions in practical applications. Specifically, the authors aim to bridge the gap between low - level functions and high - level data quality dimensions and reveal the many - to - many relationships between them. ### Detailed Explanation 1. **Background Problem**: - Data Quality (DQ) has become crucial in the modern data - driven era. High - quality data is very important for ensuring the integrity of analysis, enhancing machine - learning models, and supporting business intelligence work. - However, although many data quality tools have been developed to meet these challenges, there are still significant differences in terms of terminology use and the association between these tools and data quality dimensions. For example, different tools may describe the same function with different terms, or the same term may have different meanings in different tools. 2. **Research Objectives**: - **Unifying Terminology**: Provide a comprehensive and unified list of functions, covering all the different functions provided by six widely - used open - source data quality tools, and try to unify the terminology as much as possible. - **Mapping Functions and Dimensions**: Focus on connecting existing solutions with the data quality dimensions defined by the ISO/IEC 25012 standard and presenting the mapping relationships between low - level functions and standard dimensions. 3. **Methodology**: - Select and investigate six widely - used open - source data quality tools: Deequ, dbt Core, MobyDQ, Great Expectations (GX), Soda Core, and Apache Griffin. - Conduct in - depth research on the source code and official documentation of each tool, and identify its core functions through local experiments to avoid confusion caused by terminological diversity. - Based on the recorded functions, reveal the relationships between these functions and the data quality dimensions defined by the ISO/IEC 25012 standard. 4. **Main Contributions**: - Provide a detailed table (such as Table 1), listing the low - level functions provided by each tool and their relationships with the data quality dimensions defined by the ISO standard. - Reveal the specific engineering details and technical implementation methods of these tools in achieving data quality dimensions. ### Summary The core problem of this paper is to reveal the relationships between the low - level functions provided by these tools and their corresponding data quality dimensions through systematically investigating and analyzing existing open - source data quality tools, thereby providing theoretical support and practical guidance for understanding and improving data quality. This helps to standardize the functional descriptions of data quality tools and improve their effectiveness and consistency in practical applications.