An Efficient Data Analysis Method for Big Data using Multiple-Model Linear Regression

Bohan Lyu,Jianzhong Li
2023-08-24
Abstract:This paper introduces a new data analysis method for big data using a newly defined regression model named multiple model linear regression(MMLR), which separates input datasets into subsets and construct local linear regression models of them. The proposed data analysis method is shown to be more efficient and flexible than other regression based methods. This paper also proposes an approximate algorithm to construct MMLR models based on $(\epsilon,\delta)$-estimator, and gives mathematical proofs of the correctness and efficiency of MMLR algorithm, of which the time complexity is linear with respect to the size of input datasets. This paper also empirically implements the method on both synthetic and real-world datasets, the algorithm shows to have comparable performance to existing regression methods in many cases, while it takes almost the shortest time to provide a high prediction accuracy.
Machine Learning,Databases
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to address a key issue faced by linear regression models in big data analysis: how to efficiently construct multiple linear regression models to handle large datasets with different predictor-response variable relationships (DPRVR). Specifically, the paper proposes a new method called Multiple-Model Linear Regression (MMLR), which divides the input dataset into multiple subsets and builds local linear regression models on each subset. Compared to existing single-model or piecewise regression methods, the MMLR method not only improves prediction accuracy but also significantly reduces computational time complexity. ### Background and Challenges 1. **Diversity of Big Data**: Large datasets typically contain multiple different subsets, each of which may be suitable for different regression models. This phenomenon is known as diverse predictor-response variable relationships (DPRVR). 2. **Limitations of Existing Methods**: - **High Time Complexity**: Existing multi-model regression algorithms (such as piecewise linear regression) have high time complexity, making them difficult to apply to large-scale datasets. - **Subset Shape Restrictions**: Existing methods require subsets to be hypercubes or generated by hyperplanes, which limits their applicability. - **Need for Prior Knowledge**: Some methods require prior knowledge that is difficult to obtain. ### Solution 1. **MMLR Algorithm**: The paper proposes a new multi-model linear regression algorithm (MMLR), which is implemented through the following steps: - **Preprocessing**: Perform initial linear regression modeling on the entire dataset. If the model is sufficiently accurate, return the result directly. - **Pre-modeling**: Select a small region, sample data points from it, and build a local linear regression model. - **Testing**: Calculate the model's fit boundary and check if all data points not yet assigned to existing models conform to this model. - **Iteration**: Repeat the above steps until the number of remaining data points is less than a certain threshold or the maximum number of models is reached. 2. **Time Complexity**: The time complexity of the MMLR algorithm is \(O(m(n + (k/\epsilon)^2 + k^3))\), where \(m\) is the number of models, \(n\) is the number of data points, \(k\) is the feature dimension, and \(\epsilon\) is the user-specified maximum error limit. This is significantly lower than the time complexity \(O(k^2 n^5)\) of existing methods. 3. **Mathematical Proof**: The paper provides mathematical proof of the correctness and efficiency of the MMLR algorithm, including error bounds and time complexity analysis. ### Experimental Validation The paper conducts experiments on synthetic and real-world datasets, showing that the MMLR algorithm has comparable predictive performance to existing regression methods in many cases, but with significantly reduced computation time. ### Conclusion The MMLR method offers high interpretability, high prediction accuracy, and high model construction efficiency when dealing with large datasets. Particularly in low-dimensional cases, its time complexity is lower than that of existing piecewise regression methods. Future work directions include exploring other parametric models, improving subset selection algorithms, and methods for handling high-dimensional datasets.