Abstract:For massive data sets, efficient computation commonly relies on distributed algorithms that store and process subsets of the data on different machines, minimizing communication costs. Our focus is on regression and classification problems involving many features. A variety of distributed algorithms have been proposed in this context, but challenges arise in defining an algorithm with low communication, theoretical guarantees and excellent practical performance in general settings. We propose a MEdian Selection Subset AGgregation Estimator (message) algorithm, which attempts to solve these problems. The algorithm applies feature selection in parallel for each subset using Lasso or another method, calculates the `median' feature inclusion index, estimates coefficients for the selected features in parallel for each subset, and then averages these estimates. The algorithm is simple, involves very minimal communication, scales efficiently in both sample and feature size, and has theoretical guarantees. In particular, we show model selection consistency and coefficient estimation efficiency. Extensive experiments show excellent performance in variable selection, estimation, prediction, and computation time relative to usual competitors.

What problem does this paper attempt to address?

This paper attempts to solve the problem of efficient computation on large - scale datasets, especially in the case of a large number of features involved in regression and classification problems. Specifically, the paper focuses on how to design a distributed algorithm that can reduce communication costs while maintaining or improving statistical accuracy, thereby achieving performance comparable to or even better than applying the algorithm to the entire dataset simultaneously. ### Main Problems in the Paper 1. **Efficient Computation on Large - Scale Datasets**: - The processing of large - scale datasets usually depends on distributed algorithms, which divide the data into multiple subsets, store them on different machines, and process them in parallel to minimize communication costs. - Existing distributed algorithms face challenges in reducing communication costs, providing theoretical guarantees, and performing well in general settings. 2. **Feature Selection and Parameter Estimation**: - Regression and classification problems usually consist of two important components: feature selection and parameter estimation. - Current combination methods perform poorly in feature selection and parameter estimation, and a new method is needed to optimize these two tasks simultaneously. ### Solution The paper proposes an algorithm named **MEdian Selection Subset AGgregation Estimator (MESSAGE)** to solve the above problems. The specific steps are as follows: 1. **Parallel Feature Selection**: - Apply Lasso or other feature selection methods to each subset in parallel and calculate the "median" inclusion index of features. 2. **Parallel Parameter Estimation**: - Estimate the coefficients for the selected features in parallel and then average these estimates. ### Advantages of the Algorithm - **Low Communication Costs**: The MESSAGE algorithm involves very little communication, mainly communicating in the final aggregation step. - **Theoretical Guarantees**: The paper provides theoretical proofs of the consistency of model selection and the validity of coefficient estimation. - **Practical Performance**: Extensive experiments show that MESSAGE performs well in variable selection, estimation, prediction, and computation time and outperforms existing competitors. ### Experimental Verification The paper conducts extensive experiments through synthetic datasets and real - world datasets (such as power consumption data and HIGGS classification data) to verify the superior performance of the MESSAGE algorithm in different scenarios. ### Conclusion When processing large - scale datasets, the MESSAGE algorithm can not only significantly reduce the computation time but also achieve or exceed the performance of full - dataset analysis in feature selection and parameter estimation. This makes MESSAGE an efficient and practical distributed computing method.

Median Selection Subset Aggregation for Parallel Inference

Distributed Privacy-Aware Fast Selection Algorithm for Large-Scale Data.

Distributed Bootstrap Simultaneous Inference for High-Dimensional Quantile Regression

Communication-efficient Estimation for Distributed Subset Selection

Selective Inference with Distributed Data

A novel parallel feature rank aggregation algorithm for gene selection applied to microarray data classification

Distributed Successive Measurement Selection Based on Online Sparsity Inference

Minimax and Communication-Efficient Distributed Best Subset Selection with Oracle Property

Multi-Slot Distributed Measurement Selection: A Sparsity Learning Approach

Multiobjective Feature Selection for Microarray Data Via Distributed Parallel Algorithms.

A Spark-Based Approach for High-Efficiency Embedded Feature Selection.

One-Round Communication Efficient Distributed M-Estimation.

Spark Rough Hypercuboid Approach for Scalable Feature Selection

Communication Efficient Parallel Algorithms for Optimization on Manifolds

Parallel Aggregation Queries over Star Schema: A Hierarchical Encoding Scheme and Efficient Percentile Computing as a Case

Communication-efficient Distributed Newton-like Optimization with Gradients and M-estimators

Distributed optimization and statistical learning for large-scale penalized expectile regression

Communication-efficient Distributed Estimation of Partially Linear Additive Models for Large-Scale Data

Communication Efficient Distributed Learning with Feature Partitioned Data

A Communication-Efficient Parallel Method for Group-Lasso.

Communication-Efficient Accurate Statistical Estimation