ECS -- an Interactive Tool for Data Quality Assurance

Christian Sieberichs,Simon Geerkens,Alexander Braun,Thomas Waschulzik
2023-07-17
Abstract:With the increasing capabilities of machine learning systems and their potential use in safety-critical systems, ensuring high-quality data is becoming increasingly important. In this paper we present a novel approach for the assurance of data quality. For this purpose, the mathematical basics are first discussed and the approach is presented using multiple examples. This results in the detection of data points with potentially harmful properties for the use in safety-critical systems.
Machine Learning,Artificial Intelligence,Systems and Control
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: in machine - learning systems, especially when applied to high - risk areas (such as safety - critical systems), how to ensure the quality of data. Specifically, the author proposes a new method - ECS (Equivalent Classes Sets) to detect data points that may be harmful to high - risk systems and ensure the relevance, representativeness, error - free nature and integrity of training, validation and test data sets. ### Background and Motivation As the performance of machine - learning systems continues to improve, their application scope has gradually expanded to research, industry and daily life. However, in high - risk areas (such as medical treatment, self - driving, etc.), wrong decisions made by machine - learning systems may bring serious consequences. Therefore, ensuring the quality of data used by these systems has become crucial. The European AI Act and other projects (such as "KI - Absicherung" and "safetrAIn") also emphasize the importance of data quality in high - risk areas. ### Main Problems 1. **Data Quality Problems**: In high - risk areas, data must meet requirements such as relevance, representativeness, error - free nature and integrity. Traditional data quality assurance methods often rely on predefined rules and assumptions, which require a large amount of prior knowledge. 2. **Lack of a Universal Definition**: Although data quality and quality assurance are widespread and studied, there is currently no generally accepted definition. Different literatures define data quality differently, usually dividing it into multiple attributes such as accuracy, consistency, integrity, etc. 3. **Limitations of Existing Methods**: Existing data quality assurance methods either focus on a single dimension (such as outlier detection) or rely on complex assumptions and rules, and these methods have limited effectiveness when dealing with complex data sets. ### Solutions To solve the above problems, the author proposes the ECS (Equivalent Classes Sets) method. The main features of ECS are as follows: - **No Need for Pre - defined Rules**: ECS does not require the user to specify any assumptions or rules, but analyzes based on the relationships between data points. - **Multi - dimensional Analysis**: ECS can analyze multiple data quality attributes simultaneously, not just a single dimension. - **Interactive Tool**: ECS provides an interactive tool, and users can directly interact with data through a visual interface, simplifying and accelerating the quality assurance process. ### Method Overview The core idea of ECS is to divide the data set into input data and output data, and identify the relationships between data points by calculating the distances in the input space and the output space. The specific steps are as follows: 1. **Define Distance Metrics**: Select an appropriate distance measurement method so that semantically similar data points have a smaller distance in the input space and the output space, while dissimilar data points have a larger distance. 2. **Calculate Distances**: Calculate the input distance \( d_{RI} \) and the output distance \( d_{RO} \) between all pairs of data points. 3. **Set Thresholds**: Set the input distance threshold \( \delta_{in} \) and the output distance threshold \( \delta_{out} \) according to data quality attributes and data types. 4. **Classify Pairs of Data Points**: Classify pairs of data points into four categories according to the distance size: - \( ECS_{EE}(D) = \{dc | dc \in D^2 \wedge d_{RI}(B) \leq \delta_{in} \wedge d_{RA}(B) \leq \delta_{out}\} \) - \( ECS_{EU}(D) = \{dc | dc \in D^2 \wedge d_{RI}(B) \leq \delta_{in} \wedge d_{RA}(B) > \delta_{out}\} \) - \( ECS_{UE}(D) = \{dc | dc \in D^2 \wedge d_{RI}(B) > \delta_{in} \wedge d_{RA}(B) \leq \delta_{out}\} \) - \( ECS_{UU} \)