Abstract:Momentum technique has recently emerged as an effective strategy in accelerating convergence of gradient descent (GD) methods and exhibits improved performance in deep learning as well as regularized learning. Typical momentum examples include Nesterov's accelerated gradient (NAG) and heavy-ball (HB) methods. However, so far, almost all the acceleration analyses are only limited to NAG, and a few investigations about the acceleration of HB are reported. In this article, we address the convergence about the last iterate of HB in nonsmooth optimizations with constraints, which we name individual convergence. This question is significant in machine learning, where the constraints are required to impose on the learning structure and the individual output is needed to effectively guarantee this structure while keeping an optimal rate of convergence. Specifically, we prove that HB achieves an individual convergence rate of <span class="mjpage"><svg xmlns:xlink="http://www.w3.org/1999/xlink" width="10.795ex" height="2.843ex" style="vertical-align: -0.838ex;" viewBox="0 -863.1 4648 1223.9" role="img" focusable="false" xmlns="http://www.w3.org/2000/svg"><g stroke="currentColor" fill="currentColor" stroke-width="0" transform="matrix(1 0 0 -1 0 0)"> <use xlink:href="#MJMATHI-4F" x="0" y="0"></use> <use xlink:href="#MJMAIN-28" x="763" y="0"></use> <use xlink:href="#MJMAIN-31" x="1153" y="0"></use> <use xlink:href="#MJMAIN-2F" x="1653" y="0"></use><g transform="translate(2154,0)"> <use xlink:href="#MJMATHI-73" x="0" y="0"></use> <use xlink:href="#MJMATHI-71" x="469" y="0"></use> <use xlink:href="#MJMATHI-72" x="930" y="0"></use> <use xlink:href="#MJMATHI-74" x="1381" y="0"></use> <use xlink:href="#MJMATHI-74" x="1743" y="0"></use></g> <use xlink:href="#MJMAIN-29" x="4258" y="0"></use></g></svg></span> , where <span class="mjpage"><svg xmlns:xlink="http://www.w3.org/1999/xlink" width="0.84ex" height="2.009ex" style="vertical-align: -0.338ex;" viewBox="0 -719.6 361.5 865.1" role="img" focusable="false" xmlns="http://www.w3.org/2000/svg"><g stroke="currentColor" fill="currentColor" stroke-width="0" transform="matrix(1 0 0 -1 0 0)"> <use xlink:href="#MJMATHI-74" x="0" y="0"></use></g></svg></span> is the number of iterations. This indicates that both of the two momentum methods can accelerate the individual convergence of basic GD to be optimal. Even for the convergence of averaged iterates, our result avoids the disadvantages of the previous work in restricting the optimization problem to be unconstrained as well as limiting the performed number of iterations to be predefined. The novelty of convergence analysis presented in this article provides a clear understanding of how the HB momentum can accelerate the individual convergence and reveals more insights about the similarities and differences in getting the averaging and individual convergence rates. The derived optimal individual convergence is extended to regularized and stochastic settings, in which an individual solution can be produced by the projection-based operation. In contrast to the aver-ged output, the sparsity can be reduced remarkably without sacrificing the theoretical optimal rates. Several real experiments demonstrate the performance of HB momentum strategy.<svg xmlns="http://www.w3.org/2000/svg" style="display: none;"><defs id="MathJax_SVG_glyphs"><path stroke-width="1" id="MJMATHI-4F" d="M740 435Q740 320 676 213T511 42T304 -22Q207 -22 138 35T51 201Q50 209 50 244Q50 346 98 438T227 601Q351 704 476 704Q514 704 524 703Q621 689 680 617T740 435ZM637 476Q637 565 591 615T476 665Q396 665 322 605Q242 542 200 428T157 216Q157 126 200 73T314 19Q404 19 485 98T608 313Q637 408 637 476Z"></path><path stroke-width="1" id="MJMAIN-28" d="M94 250Q94 319 104 381T127 488T164 576T202 643T244 695T277 729T302 750H315H319Q333 750 333 741Q333 738 316 720T275 667T226 581T184 443T167 250T184 58T225 -81T274 -167T316 -220T333 -241Q333 -250 318 -250H315H302L274 -226Q180 -141 137 -14T94 250Z"></path><path stroke-width="1" id="MJMAIN-31" d="M213 578L200 573Q186 568 160 563T102 556H83V602H102Q149 604 189 617T245 641T273 663Q275 666 285 666Q294 666 302 660V361L303 61Q310 54 315 52T339 48T401 46H427V0H416Q395 3 257 3Q121 3 100 0H88V46H114Q136 46 152 46T177 47T193 50T201 52T207 57T213 61V578Z"></path><path stroke-width="1" id="MJMAIN-2F" d="M423 750Q432 750 438 744T444 730Q444 725 271 248T92 -240Q85 -250 75 -250Q68 -250 62 -245T56 -231Q56 -221 230 257T407 740Q411 750 423 750Z"></path><path stroke-width="1" id="MJMATHI-73" d="M131 289Q131 321 147 354T203 415T300 442Q362 442 390 415T419 355Q419 323 402 308T364 292Q351 292 340 300T328 326Q328 342 337 354T354 372T367 378Q368 378 368 379Q368 382 361 388T336 399T297 405Q249 405 227 379T204 326Q204 301 223 291T278 274T330 259Q396 230 396 163Q396 135 385 107T352 51T289 7T195 -10Q118 -10 86 19T53 87Q53 126 74 143T118 160Q133 160 146 151T160 120Q160 94 142 76T111 58Q109 57 108 57T107 55Q108 52 115 47T146 34T201 27Q237 27 263 38T301 66T318 97T323 122Q323 150 302 164T254 181T195 196T148 231Q131 256 131 289Z"></path><path stroke-width="1" id="MJMATHI-71" d="M33 157Q33 258 109 349T280 441Q340 441 372 389Q373 390 377 395T388 406T404 418Q438 442 450 442Q454 442 457 439T460 434Q460 425 391 149Q320 -135 320 -139Q320 -147 365 -148H390Q396 -156 396 -157T393 -175Q389 -188 383 -194H370Q339 -192 262 -192Q234 -192 211 -192T174 -192T157 -193Q143 -193 143 -185Q143 -182 145 -170Q149 -154 152 -151T172 -148Q220 -148 230 -141Q238 -136 258 -53T279 32Q279 33 272 29Q224 -10 172 -10Q117 -10 75 30T33 157ZM352 326Q329 405 277 405Q242 405 210 374T160 293Q131 214 119 129Q119 126 119 118T118 106Q118 61 136 44T179 26Q233 26 290 98L298 109L352 326Z"></path><path stroke-width="1" id="MJMATHI-72" d="M21 287Q22 290 23 295T28 317T38 348T53 381T73 411T99 433T132 442Q161 442 183 430T214 408T225 388Q227 382 228 382T236 389Q284 441 347 441H350Q398 441 422 400Q430 381 430 363Q430 333 417 315T391 292T366 288Q346 288 334 299T322 328Q322 376 378 392Q356 405 342 405Q286 405 239 331Q229 315 224 298T190 165Q156 25 151 16Q138 -11 108 -11Q95 -11 87 -5T76 7T74 17Q74 30 114 189T154 366Q154 405 128 405Q107 405 92 377T68 316T57 280Q55 278 41 278H27Q21 284 21 287Z"></path><path stroke-width="1" id="MJMATHI-74" d="M26 385Q19 392 19 395Q19 399 22 411T27 425Q29 430 36 430T87 431H140L159 511Q162 522 166 540T173 566T179 586T187 603T197 615T211 624T229 626Q247 625 254 615T261 596Q261 589 252 549T232 470L222 433Q222 431 272 431H323Q330 424 330 420Q330 398 317 385H210L174 240Q135 80 135 68Q135 26 162 26Q197 26 230 60T283 144Q285 150 288 151T303 153H307Q322 153 322 145Q322 142 319 133Q314 117 301 95T267 48T216 6T155 -11Q125 -11 98 4T59 56Q57 64 57 83V101L92 241Q127 382 128 383Q128 385 77 385H26Z"></path><path stroke-width="1" id="MJMAIN-29" d="M60 749L64 750Q69 750 74 750H86L114 726Q208 641 251 514T294 250Q294 182 284 119T261 12T224 -76T186 -143T145 -194T113 -227T90 -246Q87 -249 86 -250H74Q66 -250 63 -250T58 -247T55 -238Q56 -237 66 -225Q221 -64 221 250T66 725Q56 737 55 738Q55 746 60 749Z"></path></defs></svg>

Accelerated analysis on the triple momentum method for a two-layer ReLU neural network

Provable Acceleration of Nesterov's Accelerated Gradient Method over Heavy Ball Method in Training Over-Parameterized Neural Networks

Unified Convergence Analysis of Stochastic Momentum Methods for Convex and Non-convex Optimization

Stochastic Momentum Method with Double Acceleration for Regularized Empirical Risk Minimization

Provable convergence of Nesterov’s accelerated gradient method for over-parameterized neural networks

A Unified Analysis of Stochastic Momentum Methods for Deep Learning

A Unified Analysis of Stochastic Momentum Methods for Deep Learning.

Last-iterate convergence analysis of stochastic momentum methods for neural networks

A convergence analysis of Nesterov’s accelerated gradient method in training deep linear neural networks

A Momentum Accelerated Algorithm for ReLU-Based Nonlinear Matrix Decomposition

Convergence Analysis of Two-layer Neural Networks with ReLU Activation

Multi-stage stochastic gradient method with momentum acceleration

Convergence Rates of Training Deep Neural Networks Via Alternating Minimization Methods.

Accelerated Gradient-free Neural Network Training by Multi-convex Alternating Optimization

Momentum Acceleration in the Individual Convergence of Nonsmooth Convex Optimization With Constraints

Understanding Multi-phase Optimization Dynamics and Rich Nonlinear Behaviors of ReLU Networks

Continuous Time Analysis of Momentum Methods

A Convergence Theory Towards Practical Over-parameterized Deep Neural Networks

A Convergent ADMM Framework for Efficient Neural Network Training

Analysis of Boundedness and Convergence of Online Gradient Method for Two-Layer Feedforward Neural Networks