Abstract:Large-scale rare events data are commonly encountered in practice. To tackle the massive rare events data, we propose a novel distributed estimation method for logistic regression in a distributed system. For a distributed framework, we face the following two challenges. The first challenge is how to distribute the data. In this regard, two different distribution strategies (i.e., the RANDOM strategy and the COPY strategy) are investigated. The second challenge is how to select an appropriate type of objective function so that the best asymptotic efficiency can be achieved. Then, the under-sampled (US) and inverse probability weighted (IPW) types of objective functions are considered. Our results suggest that the COPY strategy together with the IPW objective function is the best solution for distributed logistic regression with rare events. The finite sample performance of the distributed methods is demonstrated by simulation studies and a real-world Sweden Traffic Sign dataset.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to effectively conduct logistic regression analysis when dealing with rare events in large - scale data. Specifically, the paper focuses on two main challenges faced when processing large - scale rare - event data in a distributed system: 1. **Data distribution strategy**: How to rationally distribute data among various nodes in a distributed system. The paper explores two different data distribution strategies - the random distribution strategy (RANDOM) and the copy positive - sample strategy (COPY). The random distribution strategy may lead to an overly small number of positive samples on each node, thus affecting the accuracy of statistical estimation; while the copy positive - sample strategy can better balance the data distribution of each node and improve the estimation efficiency by copying all positive samples to each node. 2. **Selection of objective function**: How to select an appropriate objective function to achieve the best asymptotic efficiency. The paper considers two types of objective functions - the under - sampled (US) objective function and the inverse probability weighted (IPW) objective function. The US objective function is used on local computers, but due to its bias problem, it results in low statistical efficiency; while the IPW objective function corrects the bias through a weighting method and can achieve the same asymptotic distribution as the global maximum likelihood estimator (GMLE), thereby improving the statistical efficiency. In summary, the main objective of the paper is to propose a new distributed logistic regression method. By combining the COPY data distribution strategy and the IPW objective function, it can achieve efficient and accurate parameter estimation in large - scale rare - event data. This method not only solves the efficiency problem of traditional methods in processing large - scale data, but also verifies its superior statistical performance through theoretical analysis and numerical experiments.

Distributed Logistic Regression for Massive Data with Rare Events

Distributed Bootstrap Simultaneous Inference for High-Dimensional Quantile Regression

Optimal Subsample Selection for Massive Logistic Regression with Distributed Data

Logistic Regression in Rare Events Data

Distributed Logistic Regression for Separated Massive Data.

Logistic Regression Bias Correction for Large Scale Data with Rare Events

Robust Regression for Non-Randomly Distributed Big Data with Application to Nonconvex Penalized Learning

Distributed Estimation for Large-Scale Cox Regression with Poisson Subsampling

Robust and efficient subsampling algorithms for massive data logistic regression

Real-time semiparametric regression for distributed data sets

Learning from Local to Global - an Efficient Distributed Algorithm for Modeling Time-to-event Data

Distributed Ordinal Regression Over Networks

Distributed Censored Regression over Networks

Deterministic Subsampling for Logistic Regression with Massive Data

Mastering Rare Event Analysis: Optimal Subsample Size in Logistic and Cox Regressions

Communication‐efficient distributed large‐scale sparse multinomial logistic regression

Distributed Sparse Recursive Least-Squares over Networks

Minimax Bounds for Distributed Logistic Regression

Distributed Statistical Inference for Massive Data

Distributed quantile regression for massive heterogeneous data

Distributed Subsampling for Multiplicative Regression