ReliefSD: Selecting Numerical Features for Fast Subgroup Discovery

Zhenfeng He,Yin Zhang
DOI: https://doi.org/10.23919/ccc55666.2022.9902132
2022-01-01
Abstract:Subgroup discovery (SD) identifies disproportionally distributed subsets from a dataset according to a target concept. Numerical features are often discretized before SD to avoid generating too many interval based patterns and aggravating the “pattern flooding” problem. However, early discretization greatly reduces the quality of subgroups. The addition of a few features, especially numerical features, often sharply prolongs the running time of SD, so removing irrelevant features may be a better choice. FSSD, a recently proposed non-discretization SD approach for numerical features, uses an empirical method to select a subset of features. Yet, the method ignores the labelling information, so it can not remove irrelevant features effectively. This paper analyses Relief based feature selection for SD, and suggests using interval based local subgroups to evaluate the discrimination ability of a feature. It presents ReliefSD, a novel feature selection method for SD by updating ReliefF. As interesting subgroups have many positive instances, ReliefSD only selects positive instances. Moreover, for each feature ReliefSD constructs a single feature based local subgroup whose boundary is defined by the randomly selected instance and its neighbouring positive instances. By evaluating the purity of the subgroups, ReliefSD iteratively estimates the importance of features. Experimental results on 10 UCI datasets suggest ReliefSD is the best in selecting feature subsets for FSSD when compared with the empirical method and ReliefF.
What problem does this paper attempt to address?