Wrangling Data Issues to be Wrangled: Literature Review, Taxonomy, and Industry Case Study

Qiaolin Qin,Heng Li,Ettore Merlo
2024-05-25
Abstract:Data quality is vital for user experience in products reliant on data. As solutions for data quality problems, researchers have developed various taxonomies for different types of issues. However, although some of the existing taxonomies are near-comprehensive, the over-complexity has limited their actionability in data issue solution development. Hence, recent researchers issued new sets of data issue categories that are more concise for better usability. Although more concise, modern data issue labeling's over-catering to the solution systems may sometimes cause the taxonomy to be not mutually exclusive. Consequently, different categories sometimes overlap in determining the issue types, or the same categories share different definitions across research. This hinders solution development and confounds issue detection. Therefore, based on observations from a literature review and feedback from our industry partner, we propose a comprehensive taxonomy of data quality issues from two distinct dimensions: the attribute dimension represents the intrinsic characteristics and the outcome dimension that indicates the manifestation of the issues. With the categories redefined, we labeled the reported data issues in our industry partner's data warehouse. The labeled issues provide us with a general idea of the distributions of each type of problem and which types of issues require the most effort and care to deal with. Our work aims to address a widely generalizable taxonomy rule in modern data quality issue engineering and helps practitioners and researchers understand their data issues and estimate the efforts required for issue fixing.
Databases,Information Theory
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the complexity and inconsistency in classifying and dealing with data quality problems in modern data quality management. Specifically, the article points out that although the existing data quality problem taxonomy is quite comprehensive, its complexity limits its operability in practical applications. In addition, modern data quality problem labels are overly simplified, which sometimes leads to non - mutually - exclusive classifications, causing the problem definitions of different categories to overlap or be ambiguous, which further hinders the development of data problem solutions and the accuracy of problem detection. To solve these problems, the author proposes a new, more comprehensive and mutually - exclusive data quality problem classification system. This classification system classifies data quality problems from two different dimensions: the attribute dimension and the outcome dimension. The attribute dimension describes the intrinsic characteristics of the problem, while the outcome dimension indicates the specific manifestation of the problem. By redefining these categories and applying them to actual data problems in the data warehouses of industrial partners, the author aims to provide a general and operable classification rule to help practitioners and researchers better understand and deal with data quality problems, and evaluate the resources and efforts required to solve the problems. ### Main contributions: 1. **Literature review**: Reviewed the research on data quality problems in the past 10 years and compared the data quality problem categories identified in these studies. 2. **New classification system**: Proposed a brand - new, hierarchical, mutually - exclusive data quality problem classification system, covering problems that undermine integrity and data smells. 3. **Multidimensional classification**: Classified data quality problems from the attribute dimension and the outcome dimension, making the classification clearer and more practical. 4. **Practical application analysis**: By classifying and analyzing data problems in a real - large - scale data warehouse, provided insights into the actual distribution and processing difficulty of these data quality problems. 5. **Development guidance**: By analyzing the issue ticket information in Jira in detail, found out which types of data problems are the most difficult for developers to fix, thus providing guidance for developers to manage and prioritize data quality problems. Through the above work, this paper not only solves the deficiencies of the existing classification system, but also provides a solid foundation for future research and practice.