Abstract:Recently, sharpness-aware minimization (SAM) has emerged as a promising method to improve generalization by minimizing sharpness, which is known to correlate well with generalization ability. Since the original proposal of SAM, many variants of SAM have been proposed to improve its accuracy and efficiency, but comparisons have mainly been restricted to the i.i.d. setting. In this paper we study SAM for out-of-distribution (OOD) generalization. First, we perform a comprehensive comparison of eight SAM variants on zero-shot OOD generalization, finding that the original SAM outperforms the Adam baseline by $4.76\%$ and the strongest SAM variants outperform the Adam baseline by $8.01\%$ on average. We then provide an OOD generalization bound in terms of sharpness for this setting. Next, we extend our study of SAM to the related setting of gradual domain adaptation (GDA), another form of OOD generalization where intermediate domains are constructed between the source and target domains, and iterative self-training is done on intermediate domains, to improve the overall target domain error. In this setting, our experimental results demonstrate that the original SAM outperforms the baseline of Adam on each of the experimental datasets by $0.82\%$ on average and the strongest SAM variants outperform Adam by $1.52\%$ on average. We then provide a generalization bound for SAM in the GDA setting. Asymptotically, this generalization bound is no better than the one for self-training in the literature of GDA. This highlights a further disconnection between the theoretical justification for SAM versus its empirical performance, with recent work finding that low sharpness alone does not account for all of SAM's generalization benefits. For future work, we provide several potential avenues for obtaining a tighter analysis for SAM in the OOD setting.

Critical Influence of Overparameterization on Sharpness-aware Minimization

How Sharpness-Aware Minimization Minimizes Sharpness?

Towards Understanding Sharpness-Aware Minimization

Sharpness-Aware Minimization Efficiently Selects Flatter Minima Late in Training

Why Does Sharpness-Aware Minimization Generalize Better Than SGD?

A Universal Class of Sharpness-Aware Minimization Algorithms

Sharpness-Aware Minimization Enhances Feature Quality via Balanced Learning

Make Sharpness-Aware Minimization Stronger: A Sparsified Perturbation Approach

Friendly Sharpness-Aware Minimization

Sharpness-Aware Minimization Revisited: Weighted Sharpness as a Regularization Term

Bilateral Sharpness-Aware Minimization for Flatter Minima

Sharpness-Aware Training for Free

Fundamental Convergence Analysis of Sharpness-Aware Minimization

The Crucial Role of Normalization in Sharpness-Aware Minimization.

Implicit Regularization of Sharpness-Aware Minimization for Scale-Invariant Problems

Convergence of Sharpness-Aware Minimization Algorithms using Increasing Batch Size and Decaying Learning Rate

Practical Sharpness-Aware Minimization Cannot Converge All the Way to Optima

Sharpness Minimization Algorithms Do Not Only Minimize Sharpness To Achieve Better Generalization

Towards Efficient and Scalable Sharpness-Aware Minimization

Towards Understanding the Role of Sharpness-Aware Minimization Algorithms for Out-of-Distribution Generalization

On Memorization and Privacy Risks of Sharpness Aware Minimization