Distributed Logistic Regression for Massive Data with Rare Events

Xuetong Li,Xuening Zhu,Hansheng Wang
DOI: https://doi.org/10.48550/arXiv.2304.02269
2023-04-05
Abstract:Large-scale rare events data are commonly encountered in practice. To tackle the massive rare events data, we propose a novel distributed estimation method for logistic regression in a distributed system. For a distributed framework, we face the following two challenges. The first challenge is how to distribute the data. In this regard, two different distribution strategies (i.e., the RANDOM strategy and the COPY strategy) are investigated. The second challenge is how to select an appropriate type of objective function so that the best asymptotic efficiency can be achieved. Then, the under-sampled (US) and inverse probability weighted (IPW) types of objective functions are considered. Our results suggest that the COPY strategy together with the IPW objective function is the best solution for distributed logistic regression with rare events. The finite sample performance of the distributed methods is demonstrated by simulation studies and a real-world Sweden Traffic Sign dataset.
Methodology
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to effectively conduct logistic regression analysis when dealing with rare events in large - scale data. Specifically, the paper focuses on two main challenges faced when processing large - scale rare - event data in a distributed system: 1. **Data distribution strategy**: How to rationally distribute data among various nodes in a distributed system. The paper explores two different data distribution strategies - the random distribution strategy (RANDOM) and the copy positive - sample strategy (COPY). The random distribution strategy may lead to an overly small number of positive samples on each node, thus affecting the accuracy of statistical estimation; while the copy positive - sample strategy can better balance the data distribution of each node and improve the estimation efficiency by copying all positive samples to each node. 2. **Selection of objective function**: How to select an appropriate objective function to achieve the best asymptotic efficiency. The paper considers two types of objective functions - the under - sampled (US) objective function and the inverse probability weighted (IPW) objective function. The US objective function is used on local computers, but due to its bias problem, it results in low statistical efficiency; while the IPW objective function corrects the bias through a weighting method and can achieve the same asymptotic distribution as the global maximum likelihood estimator (GMLE), thereby improving the statistical efficiency. In summary, the main objective of the paper is to propose a new distributed logistic regression method. By combining the COPY data distribution strategy and the IPW objective function, it can achieve efficient and accurate parameter estimation in large - scale rare - event data. This method not only solves the efficiency problem of traditional methods in processing large - scale data, but also verifies its superior statistical performance through theoretical analysis and numerical experiments.