Abstract:Feature selection is an important topic in data mining and machine learning, which aims to select an optimal feature subset for building effective and explainable prediction models. This article introduces Rough Hypercuboid based Distributed Online Feature Selection (RHDOFS) method to tackle two critical challenges of Volume and Velocity associated with Big Data. By exploring the class separability in the boundary region of rough hypercuboid approach, a novel integrated feature evaluation criterion is proposed by examining not only the explicit patterns contained in the positive region but also the useful implicit patterns derived from the boundary region. An efficient online feature selection method for streaming feature scenario is developed to identify relevant and nonredundant features in an incremental iterative fashion. Furthermore, a parallel optimization mechanism by combining both data and computational independence is further employed to accelerate the original sequential implementation. An efficient distributed online feature selection algorithm is presented and implemented on the Apache Spark platform to scale for massive amount of data by exploiting the computational capabilities of multicore clusters. Encouraging results of extensive experiments indicate the superiority and notable advantages of the proposed algorithm over the relevant and representative online feature selection algorithms. Empirical tests on scalability and extensibility also demonstrate our distributed implementation significantly reduces the computational times requirements while maintaining the prediction accuracy, and is capable of scaling well in volume of data and number of computing nodes.

Unsupervised Feature Selection on Data Streams.

A Survey on Online Feature Selection with Streaming Features

Online Scalable Streaming Feature Selection Via Dynamic Decision

K-Means Clustering with Feature Selection for Stream Data

RHDOFS: A Distributed Online Algorithm Towards Scalable Streaming Feature Selection

Unsupervised Multiview Feature Selection

Online Unsupervised Multi-view Feature Selection.

Feature Interaction for Streaming Feature Selection

Online Group Feature Selection from Feature Streams

Feature Selection on Data Stream Via Multi-Cluster Structure Preservation

Streaming Feature Selection Via Graph Diffusion

Group Feature Selection With Streaming Features

Online feature selection for multi-source streaming features

Online early terminated streaming feature selection based on Rough Set theory

Online Feature Selection with Capricious Streaming Features: A General Framework

Online Feature Selection for Streaming Features with High Redundancy Using Sliding-Window Sampling

A Streaming Feature Selection Method Based on Dynamic Feature Clustering and Particle Swarm Optimization

Semi-supervised Incremental Feature Extraction Algorithm for Large-Scale Data Stream.

Online Heterogeneous Streaming Feature Selection without Feature Type Information

Online Feature Selection for High-Dimensional Class-Imbalanced Data

Large-Scale Online Feature Selection for Ultra-High Dimensional Sparse Data.