Abstract:The rapid growth of emerging information technologies and application patterns in modern society, e.g., Internet, Internet of Things, Cloud Computing and Tri-network Convergence, has caused the advent of the era of big data. Big data contains huge values, however, mining knowledge from big data is a tremendously challenging task because of data uncertainty and inconsistency. Attribute reduction (also known as feature selection) can not only be used as an effective preprocessing step, but also exploits the data redundancy to reduce the uncertainty. However, existing solutions are designed 1) either for a single machine that means the entire data must fit in the main memory and the parallelism is limited; 2) or for the Hadoop platform which means that the data have to be loaded into the distributed memory frequently and therefore become inefficient. In this paper, we overcome these shortcomings for maximum efficiency possible, and propose a unified framework for Parallel Large-scale Attribute Reduction, termed PLAR, for big data analysis. PLAR consists of three components: 1) Granular Computing (GrC)-based initialization: it converts a decision table (i.e., original data representation) into a granularity representation which reduces the amount of space and hence can be easily cached in the distributed memory: 2) model-parallelism: it simultaneously evaluates all feature candidates and makes attribute reduction highly parallelizable; 3) data-parallelism: it computes the significance of an attribute in parallel using a MapReduce-style manner. We implement PLAR with four representative heuristic feature selection algorithms on Spark, and evaluate them on various huge datasets, including UCI and astronomical datasets, finding our method's advantages beyond existing solutions.

A Parallel Attribute Reduction Algorithm Based on Affinity Propagation Clustering.

Distributed Affinity Propagation Clustering Based on MapReduce

Attribute Granulation Based on Attribute Discernibility and AP Algorithm.

A Fast Parallel Attribute Reduction Algorithm Using Apache Spark.

Parallel incremental efficient attribute reduction algorithm based on attribute tree

A Parallel Attribute Reduction Method Based on Classification

Adjustable Preference Affinity Propagation Clustering

Feature Clustering Dimensionality Reduction Based on Affinity Propagation

Parallel Large-Scale Attribute Reduction on Cloud Systems

Fast Clustering by Affinity Propagation Based on Density Peaks.

A Density-Adaptive Affinity Propagation Clustering Algorithm Based on Spectral Dimension Reduction

A parallel rough set attribute reduction algorithm based on attribute frequency

Grouping Attributes: an Accelerator for Attribute Reduction Based on Similarity

A New Feature Weighted Affinity Propagation Clustering Algorithm

A Complete Attribute Reduction Algorithm Based on Improved FP Tree

K-AP Clustering Algorithm for Large Scale Dataset

Summary of Affinity Propagation

An improved affinity propagation clustering algorithm based on principal component analysis and variation coefficient

Fast Affinity Propagation Clustering Based on Incomplete Similarity Matrix

Affinity Propagation Clustering Algorithm Based on Large-Scale Data-Set

BiFuG2-Spark: Bi-directional Fuzzy Granular-Cabin Parallel Attribute Reduction Accelerator with Granular-Group Collaboration