Abstract:The doubly stochastic functional gradient descent algorithm (DSG) that is memory friendly and computationally efficient can effectively scale up kernel methods. However, in solving the highly ill-conditioned large-scale nonlinear machine learning problem, the convergence speed of DSG is quite slow. This is because the condition number of the Hessian matrix of this problem is quite large, which will make stochastic gradient methods converge very slowly. Fortunately, gradient preconditioning is a well-established technique in optimization aiming to reduce the condition number. Therefore, we propose a preconditioned doubly stochastic functional gradient descent algorithm (P-DSG) by combining DSG with gradient preconditioning. P-DSG first uses the gradient preconditioning to adaptively scale the individual components of the estimated functional gradient obtained by DSG, and then utilizes the preconditioned functional gradient as the descent direction in each iteration. Theoretically, an appropriate preconditioner is always the inverse of the Hessian matrix at the optimum, which is not easy to get due to its high computation cost. Therefore, we first choose an empirical covariance matrix of random Fourier features to approximate the Hessian matrix, and then perform a low-rank approximation to the empirical covariance matrix. P-DSG has a fast convergence rate O(1/t)\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$\mathcal {O}(1/t)$\end{document} and produces a smaller constant factor in the boundary than that of DSG while remains O(t)\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$\mathcal {O}(t)$\end{document} memory friendly and O(td)\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$\mathcal {O}(td)$\end{document} computationally efficient. Finally, we test the performance of P-DSG on the kernel ridge regression, kernel support vector machines, and kernel logistic regression, respectively. The experimental results show that P-DSG speeds up convergence and achieves better performance.

Scalable Kernel Ordinal Regression Via Doubly Stochastic Gradients.

Kernel Discriminant Learning for Ordinal Regression

Fast Bounded Online Gradient Descent Algorithms for Scalable Kernel-Based Online Learning

Scaling Up Generalized Kernel Methods

Incremental Sparse Bayesian Ordinal Regression

Faster doubly stochastic functional gradient by gradient preconditioning for scalable kernel methods

Triply Stochastic Gradients on Multiple Kernel Learning

Incremental learning algorithm for large-scale semi-supervised ordinal regression

Large Scale Online Kernel Classification

Asynchronous Doubly Stochastic Sparse Kernel Learning.

Incremental Support Vector Learning for Ordinal Regression

Domination-Based Ordinal Regression for Expensive Multi-Objective Optimization

Scaling up stochastic gradient descent for non-convex optimisation

Solving Large-Scale Support Vector Ordinal Regression with Asynchronous Parallel Coordinate Descent Algorithms.

Parallel Algorithm for Optimal Threshold Labeling of Ordinal Regression Methods

Demystifying SGD with Doubly Stochastic Gradients

Scalable Dual Coordinate Descent for Kernel Methods

Scalable Semi-Supervised SVM Via Triply Stochastic Gradients

Limited Memory Online Gradient Descent for Kernelized Pairwise Learning with Dynamic Averaging

Fast Second-Order Online Kernel Learning through Incremental Matrix Sketching and Decomposition