Abstract:Local SGD is a popular optimization method in distributed learning, often outperforming other algorithms in practice, including mini-batch SGD. Despite this success, theoretically proving the dominance of local SGD in settings with reasonable data heterogeneity has been difficult, creating a significant gap between theory and practice. In this paper, we provide new lower bounds for local SGD under existing first-order data heterogeneity assumptions, showing that these assumptions are insufficient to prove the effectiveness of local update steps. Furthermore, under these same assumptions, we demonstrate the min-max optimality of accelerated mini-batch SGD, which fully resolves our understanding of distributed optimization for several problem classes. Our results emphasize the need for better models of data heterogeneity to understand the effectiveness of local SGD in practice. Towards this end, we consider higher-order smoothness and heterogeneity assumptions, providing new upper bounds that imply the dominance of local SGD over mini-batch SGD when data heterogeneity is low.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper primarily explores the limitations and potential of Local Stochastic Gradient Descent (Local SGD) in the context of Intermittent Communication in distributed heterogeneous learning. Specifically, the authors aim to address the following key issues: 1. **Limitations of Existing Assumptions**: - The authors point out that under the current data heterogeneity assumptions, it is difficult to prove that Local SGD outperforms other algorithms, especially Mini-batch SGD, under a reasonable degree of data heterogeneity. This gap between theory and practice requires further investigation. - By providing new lower bounds, the authors demonstrate that these assumptions are insufficient to prove the effectiveness of local update steps. 2. **Optimality of Accelerated Mini-batch SGD**: - Under the same data heterogeneity assumptions, the authors prove that Accelerated Mini-batch SGD is optimal for certain problem classes, fully addressing the understanding of distributed optimization. - These results highlight the need for better data heterogeneity models to understand the practical effectiveness of Local SGD. 3. **Higher-Order Smoothness and Heterogeneity Assumptions**: - To better understand the advantages of Local SGD under low data heterogeneity, the authors consider higher-order smoothness and heterogeneity assumptions and provide new upper bounds. - These new assumptions indicate that Local SGD can outperform Mini-batch SGD when data heterogeneity is low. ### Main Contributions 1. **Insufficiency of Existing Assumptions**: - A new lower bound is provided, showing that under general convex settings, existing data heterogeneity assumptions are insufficient to prove the effectiveness of Local SGD. - The core finding is that there exists a smooth, convex, and quadratic problem instance where Local SGD cannot approach the shared optimal solution of clients within a limited number of communications. 2. **Optimality of Accelerated Mini-batch SGD**: - A new algorithm-independent lower bound is provided, proving that Accelerated Mini-batch SGD is optimal when machines have a shared optimal solution. - This conclusion further deepens the gap between theory and practice but also concludes the research line on finding the optimal algorithm under relevant data heterogeneity assumptions. 3. **Advantages of Higher-Order Assumptions**: - A new upper bound is provided, improving the analysis of Local SGD by capturing the effects of second-order heterogeneity and third-order smoothness. - These new assumptions indicate that Local SGD can indeed outperform Mini-batch SGD when data heterogeneity is low. ### Conclusion Through these theoretical contributions, the authors emphasize the need for more complex models to understand and explain the practical effectiveness of Local SGD. In particular, higher-order smoothness and heterogeneity assumptions provide new perspectives for understanding the advantages of Local SGD under low data heterogeneity.

The Limits and Potentials of Local SGD for Distributed Heterogeneous Learning with Intermittent Communication

Convergence of Distributed Adaptive Optimization with Local Updates

Communication-Efficient Adaptive Batch Size Strategies for Distributed Local Gradient Methods

Multi-Level Local SGD for Heterogeneous Hierarchical Networks

The Effectiveness of Local Updates for Decentralized Learning under Data Heterogeneity

SLowcal-SGD: Slow Query Points Improve Local-SGD for Stochastic Convex Optimization

Communication-Efficient Local Decentralized SGD Methods

Distributed Stochastic Optimization with Random Communication and Computational Delays: Optimal Policies and Performance Analysis

RCD-SGD: Resource-Constrained Distributed SGD in Heterogeneous Environment via Submodular Partitioning

Tackling Data Heterogeneity: A New Unified Framework for Decentralized SGD with Sample-induced Topology

Accelerating Local SGD for Non-Iid Data Using Variance Reduction

A Unified Theory of Decentralized SGD with Changing Topology and Local Updates

Data Dependent Convergence for Distributed Stochastic Optimization

Local SGD for Near-Quadratic Problems: Improving Convergence under Unconstrained Noise Conditions

Distributed Gradient Descent with Many Local Steps in Overparameterized Models

Statistical Estimation and Online Inference via Local SGD.

Convergence Analysis of Distributed Stochastic Gradient Descent with Shuffling

Local Methods with Adaptivity via Scaling

Adaptive Stochastic Gradient Descent for Fast and Communication-Efficient Distributed Learning

On the Convergence of Local Descent Methods in Federated Learning

STL-SGD: Speeding Up Local SGD with Stagewise Communication Period