Abstract:Convex regularizers is popular for sparse and low-rank learning, mainly due to their nice statistical and optimization guarantee. However, they often lead to bias estimation, thus the sparsity and accuracy is not as good as desired. This motivates replacement of convex regularizers with nonconvex one, and recently many nonconvex regularizers have been proposed and indeed better performance than convex ones can be achieved. In this survey, we summary and compare algorithms which can be applied on learning with nonconvex regularizers. 1 General Non-Convex Optimization Consider general non-convex optimization problem 1 min x g(x) (1) Assumption 1 (Difference of Convex [1, 18, 14]). g(x) can be decomposed as difference of two convex (DoC) functions, i.e. g(x) = ĝ(x)− ḡ(x). We are more interested in non-smooth g(x), otherwise, (1) can be solve simply with gradient descent. With Assumption 1, (1) can be expressed as min x {g(x) ≡ ĝ(x)− ḡ(x)} where ĝ(x) and ḡ(x) are two convex functions. Definition 1 (Critical point [1]). Under Assumption 1, a point x∗ is a critical point of g(x), if it satisfies 0 ∈ ∂ĝ(x∗)− ∂ḡ(x∗). Remark. For smooth functions, critical point is defined as point where its gradient is zero. Difference of Convex programming (DCP) [1] is a general technique to optimize (1). Basically, it uses first order expansion to approximate ḡ(x), and generates {xt, yt} by xt+1 = arg min x ĝ(x)− 〈x− xt, yt〉 , yt ∈ ∂ḡ(xt) (2) Though g(x) is not convex, we can see the sub-problem in (2) becomes convex. Therefore, algorithms for convex optimization can be applied. However, the sub-problem may not be easy to solve, and this makes DCP slow in real world applications. This method is used in [4, 19] for sparse coding with non-convex regularizers. Theorem 1.1 (Convergence of DCP). Let Assumption 1 hold, then xt generated by (2) will converge to a critical point of (1). All functions mentioned in this paper are assumed bounded from below

Fast Learning with Nonconvex L1-2 Regularization.

Fast Learning of Nonconvex `1-2-Regularizer using the Proximal Gradient Algorithm

Fast Low-Rank Matrix Learning with Nonconvex Regularization

Efficient Learning with a Family of Nonconvex Regularizers by Redistributing Nonconvexity.

A Comparison of Algorithms for Learning with Nonconvex Regularization

Revisiting $L_q(0\leq q<1)$ Norm Regularized Optimization

GIST: General Iterative Shrinkage and Thresholding for Non-convex Sparse Learning

Linear Convergence of Inexact Descent Method and Inexact Proximal Gradient Algorithms for Lower-Order Regularization Problems

Nonconvex Sparse Logistic Regression Via Proximal Gradient Descent.

Feature Selection with &Lt;inline-Formula> &Lt;tex-Math Notation="latex">$\ell_{2,1-2}$ &Lt;/tex-Math> &Lt;/inline-Formula> Regularization

Proximal Quasi-Newton for Computationally Intensive L1-regularized M-estimators

Reduced-Space Iteratively Reweighted Second-Order Methods for Nonconvex Sparse Regularization

Nonconvex Sparse Representation With Slowly Vanishing Gradient Regularizers

Fast Sparse Recovery Via Non-Convex Optimization

Training Compact DNNs with l 1 / 2 Regularization

On Convergence Rates of Linearized Proximal Algorithms for Convex Composite Optimization with Applications.

Low-rank Tensor Learning with Nonconvex Overlapped Nuclear Norm Regularization

Nonconvex Sparse Logistic Regression with Weakly Convex Regularization

High-dimensional Inference Via Lipschitz Sparsity-Yielding Regularizers.

Minimizing L 1 over L 2 norms on the gradient

Feature Selection With $\ell_{2,1-2}$ Regularization