Abstract:The doubly stochastic functional gradient descent algorithm (DSG) that is memory friendly and computationally efficient can effectively scale up kernel methods. However, in solving the highly ill-conditioned large-scale nonlinear machine learning problem, the convergence speed of DSG is quite slow. This is because the condition number of the Hessian matrix of this problem is quite large, which will make stochastic gradient methods converge very slowly. Fortunately, gradient preconditioning is a well-established technique in optimization aiming to reduce the condition number. Therefore, we propose a preconditioned doubly stochastic functional gradient descent algorithm (P-DSG) by combining DSG with gradient preconditioning. P-DSG first uses the gradient preconditioning to adaptively scale the individual components of the estimated functional gradient obtained by DSG, and then utilizes the preconditioned functional gradient as the descent direction in each iteration. Theoretically, an appropriate preconditioner is always the inverse of the Hessian matrix at the optimum, which is not easy to get due to its high computation cost. Therefore, we first choose an empirical covariance matrix of random Fourier features to approximate the Hessian matrix, and then perform a low-rank approximation to the empirical covariance matrix. P-DSG has a fast convergence rate O(1/t)\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$\mathcal {O}(1/t)$\end{document} and produces a smaller constant factor in the boundary than that of DSG while remains O(t)\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$\mathcal {O}(t)$\end{document} memory friendly and O(td)\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$\mathcal {O}(td)$\end{document} computationally efficient. Finally, we test the performance of P-DSG on the kernel ridge regression, kernel support vector machines, and kernel logistic regression, respectively. The experimental results show that P-DSG speeds up convergence and achieves better performance.

Iterative Kernel Regression with Preconditioning

Preconditioned Krylov solvers for kernel regression

Generalized Convexity-Based Inexact Projection Method for Multiple Kernel Learning

On the Nystrom Approximation for Preconditioning in Kernel Machines

FALKON: An Optimal Large Scale Kernel Method

Have ASkotch: Fast Methods for Large-scale, Memory-constrained Kernel Ridge Regression

The Kernel Conjugate Gradient Algorithms.

Optimal learning rates for Kernel Conjugate Gradient regression

Robust, randomized preconditioning for kernel ridge regression

Faster doubly stochastic functional gradient by gradient preconditioning for scalable kernel methods

Efficient kernel surrogates for neural network-based regression

Isolation Kernel: The X Factor in Efficient and Effective Large Scale Online Kernel Learning

Online Kernel Learning with a Near Optimal Sparsity Bound

Efficient Online Learning for Large-Scale Sparse Kernel Logistic Regression

Conjugate Gradients for Kernel Machines

Supervised Kernel Thinning

Convergence Analysis of Kernel Conjugate Gradient for Functional Linear Regression

Regularized Regression Problem in hyper-RKHS for Learning Kernels

How to Scale Up Kernel Methods to Be As Good As Deep Neural Nets

Large Scale Constrained Linear Regression Revisited: Faster Algorithms via Preconditioning

Learning Analysis of Kernel Ridgeless Regression with Asymmetric Kernel Learning