A Distributed Integrated Feature Selection Scheme for Column Subset Selection

Zheng Xiao,PengCheng Wei,Anthony Theodore Chronopoulos,Anne C. C. Elster
DOI: https://doi.org/10.1109/tkde.2021.3108146
IF: 9.235
2021-01-01
IEEE Transactions on Knowledge and Data Engineering
Abstract:The emergence of computer applications often encounter huge volumes of data which need to be stored and processed in a distributed way. Most of the existing distributed feature selection schemes neglect how good the subsets are that are mapped to the computational nodes, which causes a waste of time and hardware resources. In this paper, we propose a distributed integrated feature selection scheme (DIFS) with Subset Quality Evaluation (SQE). SQE studies the relevance between the quality of a subset and the number of selected features from this subset, which helps shorten the feature selection time efficiently. Feature selection algorithms used in our method and the evaluation metric used in SQE are integrable. Then, we have given the implementation of our scheme for the Column Subset Selection (CSS) problem. More specifically, we integrate a CSS algorithm in DIFS and information entropy as the SQE metric. Theoretically, we prove that the speedup of DIFS can reach m3 compared to the centralized algorithm in ideal situations where $m$m is the number of computational nodes, and give a well bounded approximation guarantee of the solution generated by scheme for CSS problem. Extensive experiments on eight data sets are used to verify the performance of scheme. Experiments results demonstrate the effectiveness of SQE and the impressive speedup DIFS can achieve. Although there is a slight increase of the reconstruction error value in some situations. Additional experiments of classification tasks reveal that the performance of DIFS is better than existing state-of-the-art distributed algorithms.
What problem does this paper attempt to address?