Abstract:This paper introduces a new data analysis method for big data using a newly defined regression model named multiple model linear regression(MMLR), which separates input datasets into subsets and construct local linear regression models of them. The proposed data analysis method is shown to be more efficient and flexible than other regression based methods. This paper also proposes an approximate algorithm to construct MMLR models based on $(\epsilon,\delta)$-estimator, and gives mathematical proofs of the correctness and efficiency of MMLR algorithm, of which the time complexity is linear with respect to the size of input datasets. This paper also empirically implements the method on both synthetic and real-world datasets, the algorithm shows to have comparable performance to existing regression methods in many cases, while it takes almost the shortest time to provide a high prediction accuracy.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to address a key issue faced by linear regression models in big data analysis: how to efficiently construct multiple linear regression models to handle large datasets with different predictor-response variable relationships (DPRVR). Specifically, the paper proposes a new method called Multiple-Model Linear Regression (MMLR), which divides the input dataset into multiple subsets and builds local linear regression models on each subset. Compared to existing single-model or piecewise regression methods, the MMLR method not only improves prediction accuracy but also significantly reduces computational time complexity. ### Background and Challenges 1. **Diversity of Big Data**: Large datasets typically contain multiple different subsets, each of which may be suitable for different regression models. This phenomenon is known as diverse predictor-response variable relationships (DPRVR). 2. **Limitations of Existing Methods**: - **High Time Complexity**: Existing multi-model regression algorithms (such as piecewise linear regression) have high time complexity, making them difficult to apply to large-scale datasets. - **Subset Shape Restrictions**: Existing methods require subsets to be hypercubes or generated by hyperplanes, which limits their applicability. - **Need for Prior Knowledge**: Some methods require prior knowledge that is difficult to obtain. ### Solution 1. **MMLR Algorithm**: The paper proposes a new multi-model linear regression algorithm (MMLR), which is implemented through the following steps: - **Preprocessing**: Perform initial linear regression modeling on the entire dataset. If the model is sufficiently accurate, return the result directly. - **Pre-modeling**: Select a small region, sample data points from it, and build a local linear regression model. - **Testing**: Calculate the model's fit boundary and check if all data points not yet assigned to existing models conform to this model. - **Iteration**: Repeat the above steps until the number of remaining data points is less than a certain threshold or the maximum number of models is reached. 2. **Time Complexity**: The time complexity of the MMLR algorithm is $O(m(n + (k/\epsilon)^2 + k^3))$, where $m$ is the number of models, $n$ is the number of data points, $k$ is the feature dimension, and $\epsilon$ is the user-specified maximum error limit. This is significantly lower than the time complexity $O(k^2 n^5)$ of existing methods. 3. **Mathematical Proof**: The paper provides mathematical proof of the correctness and efficiency of the MMLR algorithm, including error bounds and time complexity analysis. ### Experimental Validation The paper conducts experiments on synthetic and real-world datasets, showing that the MMLR algorithm has comparable predictive performance to existing regression methods in many cases, but with significantly reduced computation time. ### Conclusion The MMLR method offers high interpretability, high prediction accuracy, and high model construction efficiency when dealing with large datasets. Particularly in low-dimensional cases, its time complexity is lower than that of existing piecewise regression methods. Future work directions include exploring other parametric models, improving subset selection algorithms, and methods for handling high-dimensional datasets.

An Efficient Data Analysis Method for Big Data using Multiple-Model Linear Regression

Etemadi multiple linear regression

An efficient multiple kernel computation method for regression analysis of economic data

Unified algorithms for distributed regularized linear regression model

The linearized alternating direction method of multipliers for low-rank and fused LASSO matrix regression model

Optimizing Generalized Linear Models with Billions of Variables.

Mathematical programming for piecewise linear regression analysis

A Sequential Regression Model for Big Data with Attributive Explanatory Variables

Efficient Variable Selection for High-Dimensional Multiplicative Models: a Novel LPRE-based Approach

Robust and efficient subsampling algorithms for massive data logistic regression

Convergence of Online Learning Algorithm for a Mixture of Multiple Linear Regressions

A New Method for Mining Regression Classes in Large Data Sets

Convex-area-wise Linear Regression and Algorithms for Data Analysis

Sample size determination for multidimensional parameters and the A-optimal subsampling in a big data linear regression model

MLRMPA: an R Package of Multiple Linear Regression Model Population Analysis Based on a Cluster Sampling Technique for Variable Selection of High Dimensional Data

Distributed Subsampling for Multiplicative Regression

Modified Multi-Direction Iterative Algorithm for Separable Nonlinear Models with Missing Data

Analysis of Longitudinal Data by Combining Multiple Dynamic Covariance Models

A fast divide-and-conquer strategy for single-index model with massive data

A Performance Study of Data Mining Techniques: Multiple Linear Regression vs. Factor Analysis