Median Selection Subset Aggregation for Parallel Inference

Xiangyu Wang,Peichao Peng,David Dunson
DOI: https://doi.org/10.48550/arXiv.1410.6604
2014-10-24
Abstract:For massive data sets, efficient computation commonly relies on distributed algorithms that store and process subsets of the data on different machines, minimizing communication costs. Our focus is on regression and classification problems involving many features. A variety of distributed algorithms have been proposed in this context, but challenges arise in defining an algorithm with low communication, theoretical guarantees and excellent practical performance in general settings. We propose a MEdian Selection Subset AGgregation Estimator (message) algorithm, which attempts to solve these problems. The algorithm applies feature selection in parallel for each subset using Lasso or another method, calculates the `median' feature inclusion index, estimates coefficients for the selected features in parallel for each subset, and then averages these estimates. The algorithm is simple, involves very minimal communication, scales efficiently in both sample and feature size, and has theoretical guarantees. In particular, we show model selection consistency and coefficient estimation efficiency. Extensive experiments show excellent performance in variable selection, estimation, prediction, and computation time relative to usual competitors.
Machine Learning,Distributed, Parallel, and Cluster Computing,Computation,Methodology
What problem does this paper attempt to address?
This paper attempts to solve the problem of efficient computation on large - scale datasets, especially in the case of a large number of features involved in regression and classification problems. Specifically, the paper focuses on how to design a distributed algorithm that can reduce communication costs while maintaining or improving statistical accuracy, thereby achieving performance comparable to or even better than applying the algorithm to the entire dataset simultaneously. ### Main Problems in the Paper 1. **Efficient Computation on Large - Scale Datasets**: - The processing of large - scale datasets usually depends on distributed algorithms, which divide the data into multiple subsets, store them on different machines, and process them in parallel to minimize communication costs. - Existing distributed algorithms face challenges in reducing communication costs, providing theoretical guarantees, and performing well in general settings. 2. **Feature Selection and Parameter Estimation**: - Regression and classification problems usually consist of two important components: feature selection and parameter estimation. - Current combination methods perform poorly in feature selection and parameter estimation, and a new method is needed to optimize these two tasks simultaneously. ### Solution The paper proposes an algorithm named **MEdian Selection Subset AGgregation Estimator (MESSAGE)** to solve the above problems. The specific steps are as follows: 1. **Parallel Feature Selection**: - Apply Lasso or other feature selection methods to each subset in parallel and calculate the "median" inclusion index of features. 2. **Parallel Parameter Estimation**: - Estimate the coefficients for the selected features in parallel and then average these estimates. ### Advantages of the Algorithm - **Low Communication Costs**: The MESSAGE algorithm involves very little communication, mainly communicating in the final aggregation step. - **Theoretical Guarantees**: The paper provides theoretical proofs of the consistency of model selection and the validity of coefficient estimation. - **Practical Performance**: Extensive experiments show that MESSAGE performs well in variable selection, estimation, prediction, and computation time and outperforms existing competitors. ### Experimental Verification The paper conducts extensive experiments through synthetic datasets and real - world datasets (such as power consumption data and HIGGS classification data) to verify the superior performance of the MESSAGE algorithm in different scenarios. ### Conclusion When processing large - scale datasets, the MESSAGE algorithm can not only significantly reduce the computation time but also achieve or exceed the performance of full - dataset analysis in feature selection and parameter estimation. This makes MESSAGE an efficient and practical distributed computing method.