Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes

Ohad Shamir, Tong Zhang
2013-02-13
Abstract:Stochastic Gradient Descent (SGD) is one of the simplest and most popular stochastic optimization methods. While it has already been theoretically studied for decades, the classical analysis usually required non-trivial smoothness assumptions, which do not apply to many modern applications of SGD with non-smooth objective functions such as support vector machines. In this paper, we investigate the performance of SGD\emphwithout such smoothness assumptions, as well as a running average scheme to convert the SGD iterates to a solution with optimal optimization accuracy. In this framework, we prove that after T rounds, the suboptimality of the\emphlast SGD iterate scales as O (\log (T)/\sqrtT) for non-smooth convex objective functions, and O (\log (T)/T) in the non-smooth strongly convex case. To the best of our knowledge, these are the first bounds of this kind, and almost match the minimax-optimal rates obtainable by appropriate averaging schemes. We also propose a new and simple averaging scheme, which not only attains optimal rates, but can also be easily computed on-the-fly (in contrast, the suffix averaging scheme proposed in\citetRakhShaSri12arxiv is not as simple to implement). Finally, we provide some experimental illustrations.
What problem does this paper attempt to address?