Abstract:<p class="a-plus-plus">Our main goal in this paper is to show that one can skip gradient computations for gradient descent type methods applied to certain structured convex programming (CP) problems. To this end, we first present an accelerated gradient sliding (AGS) method for minimizing the summation of two smooth convex functions with different Lipschitz constants. We show that the AGS method can skip the gradient computation for one of these smooth components without slowing down the overall optimal rate of convergence. This result is much sharper than the classic black-box CP complexity results especially when the difference between the two Lipschitz constants associated with these components is large. We then consider an important class of bilinear saddle point problem whose objective function is given by the summation of a smooth component and a nonsmooth one with a bilinear saddle point structure. Using the aforementioned AGS method for smooth composite optimization and Nesterov's smoothing technique, we show that one only needs <span class="a-plus-plus inline-equation id-i-eq1"><span class="a-plus-plus equation-source format-t-e-x"><span class="mjpage"><svg xmlns:xlink="http://www.w3.org/1999/xlink" width="9.004ex" height="2.843ex" style="vertical-align: -0.838ex;" viewBox="0 -863.1 3876.5 1223.9" role="img" focusable="false" xmlns="http://www.w3.org/2000/svg"><g stroke="currentColor" fill="currentColor" stroke-width="0" transform="matrix(1 0 0 -1 0 0)"> <use xlink:href="#MJCAL-4F" x="0" y="0"></use> <use xlink:href="#MJMAIN-28" x="796" y="0"></use> <use xlink:href="#MJMAIN-31" x="1186" y="0"></use> <use xlink:href="#MJMAIN-2F" x="1686" y="0"></use><g transform="translate(2187,0)"> <use xlink:href="#MJMAIN-221A" x="0" y="-80"></use><rect stroke="none" width="466" height="60" x="833" y="661"></rect> <use xlink:href="#MJMATHI-3B5" x="833" y="0"></use></g> <use xlink:href="#MJMAIN-29" x="3487" y="0"></use></g></svg></span></span></span> gradient computations for the smooth component while still preserving the optimal <span class="a-plus-plus inline-equation id-i-eq2"><span class="a-plus-plus equation-source format-t-e-x"><span class="mjpage"><svg xmlns:xlink="http://www.w3.org/1999/xlink" width="7.068ex" height="2.843ex" style="vertical-align: -0.838ex;" viewBox="0 -863.1 3043 1223.9" role="img" focusable="false" xmlns="http://www.w3.org/2000/svg"><g stroke="currentColor" fill="currentColor" stroke-width="0" transform="matrix(1 0 0 -1 0 0)"> <use xlink:href="#MJCAL-4F" x="0" y="0"></use> <use xlink:href="#MJMAIN-28" x="796" y="0"></use> <use xlink:href="#MJMAIN-31" x="1186" y="0"></use> <use xlink:href="#MJMAIN-2F" x="1686" y="0"></use> <use xlink:href="#MJMATHI-3B5" x="2187" y="0"></use> <use xlink:href="#MJMAIN-29" x="2653" y="0"></use></g></svg></span></span></span> overall iteration complexity for solving these saddle point problems. We demonstrate that even more significant savings on gradient computations can be obtained for strongly convex smooth and bilinear saddle point problems.</p><svg xmlns="http://www.w3.org/2000/svg" style="display: none;"><defs id="MathJax_SVG_glyphs"><path stroke-width="1" id="MJCAL-4F" d="M308 428Q289 428 289 438Q289 457 318 508T378 593Q417 638 475 671T599 705Q688 705 732 643T777 483Q777 380 733 285T620 123T464 18T293 -22Q188 -22 123 51T58 245Q58 327 87 403T159 533T249 626T333 685T388 705Q404 705 404 693Q404 674 363 649Q333 632 304 606T239 537T181 429T158 290Q158 179 214 114T364 48Q489 48 583 165T677 438Q677 473 670 505T648 568T601 617T528 636Q518 636 513 635Q486 629 460 600T419 544T392 490Q383 470 372 459Q341 430 308 428Z"></path><path stroke-width="1" id="MJMAIN-28" d="M94 250Q94 319 104 381T127 488T164 576T202 643T244 695T277 729T302 750H315H319Q333 750 333 741Q333 738 316 720T275 667T226 581T184 443T167 250T184 58T225 -81T274 -167T316 -220T333 -241Q333 -250 318 -250H315H302L274 -226Q180 -141 137 -14T94 250Z"></path><path stroke-width="1" id="MJMAIN-31" d="M213 578L200 573Q186 568 160 563T102 556H83V602H102Q149 604 189 617T245 641T273 663Q275 666 285 666Q294 666 302 660V361L303 61Q310 54 315 52T339 48T401 46H427V0H416Q395 3 257 3Q121 3 100 0H88V46H114Q136 46 152 46T177 47T193 50T201 52T207 57T213 61V578Z"></path><path stroke-width="1" id="MJMAIN-2F" d="M423 750Q432 750 438 744T444 730Q444 725 271 248T92 -240Q85 -250 75 -250Q68 -250 62 -245T56 -231Q56 -221 230 257T407 740Q411 750 423 750Z"></path><path stroke-width="1" id="MJMATHI-3B5" d="M190 -22Q124 -22 76 11T27 107Q27 174 97 232L107 239L99 248Q76 273 76 304Q76 364 144 408T290 452H302Q360 452 405 421Q428 405 428 392Q428 381 417 369T391 356Q382 356 371 365T338 383T283 392Q217 392 167 368T116 308Q116 289 133 272Q142 263 145 262T157 264Q188 278 238 278H243Q308 278 308 247Q308 206 223 206Q177 206 142 219L132 212Q68 169 68 112Q68 39 201 39Q253 39 286 49T328 72T345 94T362 105Q376 103 376 88Q376 79 365 62T334 26T275 -8T190 -22Z"></path><path stroke-width="1" id="MJMAIN-221A" d="M95 178Q89 178 81 186T72 200T103 230T169 280T207 309Q209 311 212 311H213Q219 311 227 294T281 177Q300 134 312 108L397 -77Q398 -77 501 136T707 565T814 786Q820 800 834 800Q841 800 846 794T853 782V776L620 293L385 -193Q381 -200 366 -200Q357 -200 354 -197Q352 -195 256 15L160 225L144 214Q129 202 113 190T95 178Z"></path><path stroke-width="1" id="MJMAIN-29" d="M60 749L64 750Q69 750 74 750H86L114 726Q208 641 251 514T294 250Q294 182 284 119T261 12T224 -76T186 -143T145 -194T113 -227T90 -246Q87 -249 86 -250H74Q66 -250 63 -250T58 -247T55 -238Q56 -237 66 -225Q221 -64 221 250T66 725Q56 737 55 738Q55 746 60 749Z"></path></defs></svg>

Composing Optimized Stepsize Schedules for Gradient Descent

Accelerated Gradient Descent by Concatenation of Stepsize Schedules

Accelerated Gradient Descent via Long Steps

Accelerated Objective Gap and Gradient Norm Convergence for Gradient Descent via Long Steps

Accelerating Proximal Gradient Descent via Silver Stepsizes

Acceleration by Stepsize Hedging I: Multi-Step Descent and the Silver Stepsize Schedule

Provably Faster Gradient Descent via Long Steps

Locally Optimal Descent for Dynamic Stepsize Scheduling

Asynchronous Proximal Stochastic Gradient Algorithm for Composition Optimization Problems

Adaptive Accelerated Composite Minimization

Stochastically Controlled Compositional Gradient for Composition Problems

Anytime Acceleration of Gradient Descent

Acceleration by Random Stepsizes: Hedging, Equalization, and the Arcsine Stepsize Schedule

A Strengthened Conjecture on the Minimax Optimal Constant Stepsize for Gradient Descent

Two efficient gradient methods with approximately optimal stepsizes based on regularization models for unconstrained optimization

A Smoothing Stochastic Gradient Method for Composite Optimization

Universality of AdaGrad Stepsizes for Stochastic Optimization: Inexact Oracle, Acceleration and Variance Reduction

Accelerated gradient sliding for structured convex optimization

An Improved Gradient Method with Approximately Optimal Stepsize Based on Conic model for Unconstrained Optimization

Gradient Methods with Adaptive Step-Sizes