Abstract:Logistic regression is one of the most popular methods in binary classification, wherein estimation of model parameters is carried out by solving the maximum likelihood (ML) optimization problem, and the ML estimator is defined to be the optimal solution of this problem. It is well known that the ML estimator exists when the data is non-separable, but fails to exist when the data is separable. First-order methods are the algorithms of choice for solving large-scale instances of the logistic regression problem. In this paper, we introduce a pair of condition numbers that measure the degree of non-separability or separability of a given dataset in the setting of binary classification, and we study how these condition numbers relate to and inform the properties and the convergence guarantees of first-order methods. When the training data is non-separable, we show that the degree of non-separability naturally enters the analysis and informs the properties and convergence guarantees of two standard first-order methods: steepest descent (for any given norm) and stochastic gradient descent. Expanding on the work of Bach, we also show how the degree of non-separability enters into the analysis of linear convergence of steepest descent (without needing strong convexity), as well as the adaptive convergence of stochastic gradient descent. When the training data is separable, first-order methods rather curiously have good empirical success, which is not well understood in theory. In the case of separable data, we demonstrate how the degree of separability enters into the analysis of $\ell_2$ steepest descent and stochastic gradient descent for delivering approximate-maximum-margin solutions with associated computational guarantees as well. This suggests that first-order methods can lead to statistically meaningful solutions in the separable case, even though the ML solution does not exist.

The Space Complexity of Approximating Logistic Loss

On Coresets for Logistic Regression

Agnostic Learnability of Halfspaces via Logistic Loss

On the complexity of logistic regression models

Feature Space Sketching for Logistic Regression

Minimax Bounds for Distributed Logistic Regression

Some Worst-Case Datasets of Deterministic First-Order Methods for Solving Binary Logistic Regression

On the Universality of the Logistic Loss Function

Logistic Regression: Tight Bounds for Stochastic and Online Optimization

Compression and Aggregation for Logistic Regression Analysis in Data Cubes

Condition Number Analysis of Logistic Regression, and its Implications for Standard First-Order Solution Methods

Avoiding spurious correlations via logit correction

Efficient and robust high-dimensional sparse logistic regression via nonlinear primal-dual hybrid gradient algorithms

The Dice loss in the context of missing or empty labels: Introducing $Φ$ and $ε$

Efficient improper learning for online logistic regression

A Picture's Worth a Thousand Words: Visualizing n-dimensional Overlap in Logistic Regression Models with Empirical Likelihood

High-dimensional logistic entropy clustering

Query Complexity of Least Absolute Deviation Regression via Robust Uniform Convergence

On Coresets for Regularized Loss Minimization

Dimension-free uniform concentration bound for logistic regression

Noninteractive Locally Private Learning of Linear Models via Polynomial Approximations