Abstract:The Stochastic Gradient Descent method (SGD) and its stochastic variants have become methods of choice for solving finite-sum optimization problems arising from machine learning and data science thanks to their ability to handle large-scale applications and big datasets. In the last decades, researchers have made substantial effort to study the theoretical performance of SGD and its shuffling variants. However, only limited work has investigated its shuffling momentum variants, including shuffling heavy-ball momentum schemes for non-convex problems and Nesterov's momentum for convex settings. In this work, we extend the analysis of the shuffling momentum gradient method developed in [Tran et al (2021)] to both finite-sum convex and strongly convex optimization problems. We provide the first analysis of shuffling momentum-based methods for the strongly convex setting, attaining a convergence rate of $O(1/nT^2)$, where $n$ is the number of samples and $T$ is the number of training epochs. Our analysis is a state-of-the-art, matching the best rates of existing shuffling stochastic gradient algorithms in the literature.

What problem does this paper attempt to address?

### The problems the paper attempts to solve The paper aims to solve the common finite and convex optimization problems in machine learning and data science. Specifically, the paper focuses on the ability of the Stochastic Gradient Descent method (SGD) and its variants when dealing with large - scale applications and large - data sets, especially the convergence performance of the momentum stochastic gradient method in convex and strongly convex optimization settings. ### Background and motivation 1. **Problem background**: - The Stochastic Gradient Descent method (SGD) and its stochastic variants have become the preferred methods for solving finite and optimization problems in machine learning and data science because they can handle large - scale applications and large - data sets. - Researchers have conducted a great deal of research on the theoretical performance of SGD and its shuffled variants, but relatively little research has been done on the shuffled variants with momentum, especially for the heavy - ball momentum scheme for non - convex problems and Nesterov momentum in convex settings. 2. **Research motivation**: - Expand the analysis of shuffled momentum gradient methods, especially apply them to convex and strongly convex optimization problems. - Provide the first analysis of shuffled momentum methods in strongly convex settings, achieving a convergence rate of $ O\left(\frac{1}{nT^2}\right) $, where $ n $ is the number of samples and $ T $ is the number of training rounds. - Enhance the understanding of these methods by analyzing shuffled momentum methods in different settings. ### Main contributions 1. **Algorithm extension**: - Re - examine the Shuffled Momentum Gradient (SMG) algorithm developed in previous work and apply it to convex and strongly convex optimization problems. - Fill the gap between non - convex and convex settings and provide a broad understanding of shuffled gradient methods and their momentum variants in different settings. 2. **Convergence rate analysis**: - Conduct the first analysis of the SMG algorithm in strongly convex settings, achieving a convergence rate of $ O\left(\frac{1}{nT^2}\right) $. - This analysis result matches the convergence rate of the best shuffled stochastic gradient algorithm in the existing literature. 3. **Technical assumptions and key derivations**: - Detail the technical assumptions of the algorithm, including smoothness and convexity assumptions. - Provide key lemmas and theorems to prove the convergence of the algorithm in different settings. ### Related work 1. **Shuffled gradient methods**: - In the era of big data, random permutation methods are favored because of their excellent practical performance and simple implementation. - Recent research shows that shuffled methods have better convergence performance than SGD theoretically, especially in strongly convex settings. 2. **Momentum methods**: - Although significant progress has been made in the shuffled variants of SGD, relatively little research has been done on the shuffled adaptation of well - known momentum methods (such as the heavy - ball method and the Adam - type algorithm with adaptive step size). - This paper fills this gap and provides a detailed analysis of shuffled momentum methods. ### Experimental verification 1. **Experimental setup**: - Conduct numerical experiments using the logistic regression model on binary classification problems. The data sets are from LIBSVM, including w8a (49,749 samples) and ijcnn1 (91,701 samples). - Compare the performance of the SMG algorithm with that of the standard Stochastic Gradient method (SGD), the Stochastic Gradient method with momentum (SGD - M), and the Adam algorithm. 2. **Experimental results**: - The experimental results verify the validity of the theoretical analysis and show the superior performance of the SMG algorithm in different settings. ### Conclusion Through the extension and detailed analysis of the shuffled momentum gradient method, this paper provides a new and effective solution for convex and strongly convex optimization problems. In particular, it conducts the first analysis of the shuffled momentum method in strongly convex settings and achieves the optimal convergence rate. These results not only enhance the understanding of these methods but also provide strong support for practical applications.

Shuffling Momentum Gradient Algorithm for Convex Optimization

Variance-Reduced Shuffling Gradient Descent with Momentum for Finite-Sum Minimization.

Shuffling Gradient-Based Methods for Nonconvex-Concave Minimax Optimization

Random Scaling and Momentum for Non-smooth Non-convex Optimization

Learning-rate-free Momentum SGD with Reshuffling Converges in Nonsmooth Nonconvex Optimization

Convergence Analysis of Distributed Stochastic Gradient Descent with Shuffling

Shuffling Gradient Descent-Ascent with Variance Reduction for Nonconvex-Strongly Concave Smooth Minimax Problems

Convergence rates of stochastic gradient method with independent sequences of step-size and momentum weight

Random Reshuffling with Momentum for Nonconvex Problems: Iteration Complexity and Last Iterate Convergence

Unified Convergence Analysis of Stochastic Momentum Methods for Convex and Non-convex Optimization

Empirical Risk Minimization with Shuffled SGD: A Primal-Dual Perspective and Improved Bounds

Shuffling-type gradient method with bandwidth-based step sizes for finite-sum optimization

Stochastic Gradient Descent with Nonlinear Conjugate Gradient-Style Adaptive Momentum

The Marginal Value of Momentum for Small Learning Rate SGD

Adaptive Random Walk Gradient Descent for Decentralized Optimization.

Non-Convex Stochastic Composite Optimization with Polyak Momentum

Parallel Momentum Methods Under Biased Gradient Estimations

On Convergence of Incremental Gradient for Non-Convex Smooth Functions

Continuous Time Analysis of Momentum Methods

On the Last-Iterate Convergence of Shuffling Gradient Methods

A Unified Analysis of Stochastic Momentum Methods for Deep Learning