Mining Incomplete Data Using Global and Saturated Probabilistic Approximations Based on Characteristic Sets and Maximal Consistent Blocks

Patrick G. Clark,Jerzy W. Grzymala-Busse,Zdzislaw S. Hippe,Teresa Mroczek
DOI: https://doi.org/10.1016/j.ins.2024.120287
IF: 8.1
2024-02-05
Information Sciences
Abstract:In this paper, we discuss a rough set approach to missing attribute values. Among many ways of interpreting missing values, in this paper we focus on two interpretations, lost values and "do not care" conditions. Using these interpretations, global and saturated probabilistic approximations are constructed with two types of granules: characteristic sets and maximal consistent blocks. We compare eight approaches, combining two interpretations of missing attribute values, two types of probabilistic approximations with two types of granules using an error rate that is computed as a result of ten-fold cross-validation. Using a 5% level of statistical significance, we present the experimental results for these eight approaches, showing statistically significant differences between all approaches to mining incomplete data. The results also show that no one method and approach is the best for every data set and that all eight approaches should be attempted. The final section of the paper presents the idea of concept-compatible data sets. We show that for these types of data sets, global and saturated probabilistic approximations for a concept are identical to the concept. We also show that for an incomplete data sets with no duplicate rows using the lost interpretation of missing attribute values, the data set is concept-compatible.
computer science, information systems
What problem does this paper attempt to address?