Wrangling Data Issues to be Wrangled: Literature Review, Taxonomy, and Industry Case Study

Qiaolin Qin,Heng Li,Ettore Merlo
2024-05-25
Abstract:Data quality is vital for user experience in products reliant on data. As solutions for data quality problems, researchers have developed various taxonomies for different types of issues. However, although some of the existing taxonomies are near-comprehensive, the over-complexity has limited their actionability in data issue solution development. Hence, recent researchers issued new sets of data issue categories that are more concise for better usability. Although more concise, modern data issue labeling's over-catering to the solution systems may sometimes cause the taxonomy to be not mutually exclusive. Consequently, different categories sometimes overlap in determining the issue types, or the same categories share different definitions across research. This hinders solution development and confounds issue detection. Therefore, based on observations from a literature review and feedback from our industry partner, we propose a comprehensive taxonomy of data quality issues from two distinct dimensions: the attribute dimension represents the intrinsic characteristics and the outcome dimension that indicates the manifestation of the issues. With the categories redefined, we labeled the reported data issues in our industry partner's data warehouse. The labeled issues provide us with a general idea of the distributions of each type of problem and which types of issues require the most effort and care to deal with. Our work aims to address a widely generalizable taxonomy rule in modern data quality issue engineering and helps practitioners and researchers understand their data issues and estimate the efforts required for issue fixing.
Databases,Information Theory
What problem does this paper attempt to address?