AdAdaGrad: Adaptive Batch Size Schemes for Adaptive Gradient Methods

Tim Tsz-Kit Lau,Han Liu,Mladen Kolar
2024-05-28
Abstract:The choice of batch sizes in minibatch stochastic gradient optimizers is critical in large-scale model training for both optimization and generalization performance. Although large-batch training is arguably the dominant training paradigm for large-scale deep learning due to hardware advances, the generalization performance of the model deteriorates compared to small-batch training, leading to the so-called "generalization gap" phenomenon. To mitigate this, we investigate adaptive batch size strategies derived from adaptive sampling methods, originally developed only for stochastic gradient descent. Given the significant interplay between learning rates and batch sizes, and considering the prevalence of adaptive gradient methods in deep learning, we emphasize the need for adaptive batch size strategies in these contexts. We introduce AdAdaGrad and its scalar variant AdAdaGradNorm, which progressively increase batch sizes during training, while model updates are performed using AdaGrad and AdaGradNorm. We prove that AdAdaGradNorm converges with high probability at a rate of $\mathscr{O}(1/K)$ to find a first-order stationary point of smooth nonconvex functions within $K$ iterations. AdAdaGrad also demonstrates similar convergence properties when integrated with a novel coordinate-wise variant of our adaptive batch size strategies. We corroborate our theoretical claims by performing image classification experiments, highlighting the merits of the proposed schemes in terms of both training efficiency and model generalization. Our work unveils the potential of adaptive batch size strategies for adaptive gradient optimizers in large-scale model training.
Machine Learning,Optimization and Control
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to select the optimal batch size in large - scale model training, especially when training with adaptive gradient methods (such as AdaGrad and AdaGrad - Norm). Specifically, the paper focuses on how to reduce the generalization gap in large - batch training through adaptive batch - size strategies while maintaining the utilization efficiency of large batches in the later training process. The paper proposes two new adaptive batch - size schemes - AdAdaGrad and AdAdaGrad - Norm, and proves the convergence of these schemes on non - convex smooth functions. In addition, the paper also verifies the effectiveness of the proposed methods through image classification experiments, demonstrating their advantages in training efficiency and model generalization performance. ### Key Point Summary: 1. **Problem Background**: - In large - scale model training, the selection of batch size is crucial for optimization and generalization performance. - Although large - batch training has become common with hardware support, it will lead to a decline in the model's generalization performance, the so - called "generalization gap" phenomenon. - Adaptive gradient methods (such as AdaGrad and AdaGrad - Norm) are widely used in deep learning, but the existing adaptive batch - size strategies are mainly for traditional stochastic gradient descent (SGD). 2. **Research Objectives**: - Explore adaptive batch - size strategies suitable for adaptive gradient methods. - Propose new adaptive batch - size schemes (AdAdaGrad and AdAdaGrad - Norm) and prove their convergence. - Verify the advantages of the new methods in training efficiency and model generalization performance through experiments. 3. **Technical Contributions**: - Establish the sub - linear convergence rate (with high probability) of AdAdaGrad - Norm and AdAdaGrad on non - convex smooth functions. - Relax the Lipschitz smoothness condition of the objective function and adopt a more general smoothness concept. - Demonstrate the effectiveness of the new methods through numerical experiments in image classification tasks. 4. **Experimental Results**: - Experiments were carried out on the MNIST and CIFAR - 10 datasets, using logistic regression and three - layer CNN models. - The experimental results show that AdAdaGrad and AdAdaGrad - Norm are superior to traditional SGD and AdaGrad in training efficiency and model generalization performance. ### Formula Summary: - **Update Rules of AdAdaGrad and AdAdaGrad - Norm**: - AdaGrad: \[ v_k = v_{k - 1}+g_k^2, \quad x_{k + 1}=x_k-\alpha g_k\odot v_k^{- 1/2} \] - AdaGrad - Norm: \[ v_k = v_{k - 1}+\|g_k\|^2, \quad x_{k + 1}=x_k-\frac{\alpha g_k}{\sqrt{v_k}} \] - **Norm Test**: \[ \delta_B(x)=\|\nabla F_B(x)-\nabla F(x)\|\leq\eta\|\nabla F(x)\| \] - Approximate form: \[ \frac{1}{b}\text{Var}_{i\in B}(\nabla f(x;\xi_i))\leq\eta^2\|\nabla F_B(x)\|^2 \] - **Inner Product Test**: \[ \frac{1}{b}\mathbb{E}_k\left[\left(\langle\nabla f(x_k;\xi_i),\nabla F(x_k)\rangle-\|\nabla F(