Abstract:Convex regularizers is popular for sparse and low-rank learning, mainly due to their nice statistical and optimization guarantee. However, they often lead to bias estimation, thus the sparsity and accuracy is not as good as desired. This motivates replacement of convex regularizers with nonconvex one, and recently many nonconvex regularizers have been proposed and indeed better performance than convex ones can be achieved. In this survey, we summary and compare algorithms which can be applied on learning with nonconvex regularizers. 1 General Non-Convex Optimization Consider general non-convex optimization problem 1 min x g(x) (1) Assumption 1 (Difference of Convex [1, 18, 14]). g(x) can be decomposed as difference of two convex (DoC) functions, i.e. g(x) = ĝ(x)− ḡ(x). We are more interested in non-smooth g(x), otherwise, (1) can be solve simply with gradient descent. With Assumption 1, (1) can be expressed as min x {g(x) ≡ ĝ(x)− ḡ(x)} where ĝ(x) and ḡ(x) are two convex functions. Definition 1 (Critical point [1]). Under Assumption 1, a point x∗ is a critical point of g(x), if it satisfies 0 ∈ ∂ĝ(x∗)− ∂ḡ(x∗). Remark. For smooth functions, critical point is defined as point where its gradient is zero. Difference of Convex programming (DCP) [1] is a general technique to optimize (1). Basically, it uses first order expansion to approximate ḡ(x), and generates {xt, yt} by xt+1 = arg min x ĝ(x)− 〈x− xt, yt〉 , yt ∈ ∂ḡ(xt) (2) Though g(x) is not convex, we can see the sub-problem in (2) becomes convex. Therefore, algorithms for convex optimization can be applied. However, the sub-problem may not be easy to solve, and this makes DCP slow in real world applications. This method is used in [4, 19] for sparse coding with non-convex regularizers. Theorem 1.1 (Convergence of DCP). Let Assumption 1 hold, then xt generated by (2) will converge to a critical point of (1). All functions mentioned in this paper are assumed bounded from below

Learning with Non-Convex Truncated Losses by SGD

Noisy Truncated SGD: Optimization and Generalization

High Probability Guarantees for Nonconvex Stochastic Gradient Descent with Heavy Tails.

Improving Sparsity and Scalability in Regularized Nonconvex Truncated-Loss Learning Problems

An Alternative View: When Does SGD Escape Local Minima?

On Generalization Error Bounds of Noisy Gradient Methods for Non-Convex Learning

Large-scale Robust Regression with Truncated Loss Via Majorization-Minimization Algorithm

On the Unstable Convergence Regime of Gradient Descent

Stochastic Gradient Descent Introduces an Effective Landscape-Dependent Regularization Favoring Flat Solutions

Gradient Normalization Provably Benefits Nonconvex SGD under Heavy-Tailed Noise

Experimental Exploration on Loss Surface of Deep Neural Network

Beyond Convexity: Stochastic Quasi-Convex Optimization

A Comparison of Algorithms for Learning with Nonconvex Regularization

Demystifying the Myths and Legends of Nonconvex Convergence of SGD

Generalized Correntropy Induced Loss Function for Deep Learning.

Generalization Bounds of SGLD for Non-convex Learning: Two Theoretical Viewpoints.

Learning Surrogate Losses

Truncated Non-Uniform Quantization for Distributed SGD

Implicit Regularization or Implicit Conditioning? Exact Risk Trajectories of SGD in High Dimensions

An Majorize-Minimize Algorithm Framework for Large Scale Truncated Loss Classifiers

Risk Bounds of Accelerated SGD for Overparameterized Linear Regression