Abstract:Batch Normalization (BN) is widely used in {centralized} deep learning to improve convergence and generalization. However, in {federated} learning (FL) with decentralized data, prior work has observed that training with BN could hinder performance and suggested replacing it with Group Normalization (GN). In this paper, we revisit this substitution by expanding the empirical study conducted in prior work. Surprisingly, we find that BN outperforms GN in many FL settings. The exceptions are high-frequency communication and extreme non-IID regimes. We reinvestigate factors that are believed to cause this problem, including the mismatch of BN statistics across clients and the deviation of gradients during local training. We empirically identify a simple practice that could reduce the impacts of these factors while maintaining the strength of BN. Our approach, which we named FIXBN, is fairly easy to implement, without any additional training or communication costs, and performs favorably across a wide range of FL settings. We hope that our study could serve as a valuable reference for future practical usage and theoretical analysis in FL.

What problem does this paper attempt to address?

This paper discusses the usage of Batch Normalization (BN) and Group Normalization (GN) in Federated Learning (FL). In centralized deep learning, BN typically improves the convergence speed and generalization ability of models. However, previous research has found that BN may reduce performance in FL and suggests using GN instead. However, the authors re-examine this issue and find that in many FL settings, BN actually performs better than GN, especially in low communication frequency or less severe non-IID data scenarios. The paper points out that BN performs poorly in high communication frequency and extreme non-IID distributions, which is related to the mismatch of local batch statistics and gradient biases during training. The authors further investigate these factors and propose a method called FIXBN, which preserves the advantages of BN while alleviating the negative effects caused by the mismatch of statistical information and gradient biases. FIXBN does not require additional training or communication costs and performs well in a wide range of FL settings. The working principle of FIXBN is to use the standard BN method for training in the initial stage, and as the communication rounds increase, freeze the BN layer and use globally accumulated statistical information for feature normalization. This restores gradients similar to centralized learning in high communication frequency settings and eliminates the mismatch of normalization statistical information during training and testing. Through extensive empirical research, the authors demonstrate the performance of BN in various FL settings and suggest considering the specific setting of FL when selecting normalization methods. FIXBN significantly improves the performance of BN in high communication frequency scenarios and performs better than GN and BN in various FL settings. The authors hope that this research can provide valuable references for future FL practices and theoretical analysis.

Making Batch Normalization Great in Federated Deep Learning

Why Batch Normalization Damage Federated Learning on Non-IID Data?

Overcoming the Challenges of Batch Normalization in Federated Learning

Understanding the Training Dynamics in Federated Deep Learning via Aggregation Weight Optimization

FedBN: Federated Learning on Non-IID Features via Local Batch Normalization

BN-SCAFFOLD: controlling the drift of Batch Normalization statistics in Federated Learning

Generalized Batch Normalization: Towards Accelerating Deep Neural Networks

FedWon: Triumphing Multi-domain Federated Learning Without Normalization

Research Progress on Batch Normalization of Deep Learning and Its Related Algorithms

Experimenting With Normalization Layers in Federated Learning on Non-IID Scenarios

Decorrelated Batch Normalization

Unified Batch Normalization: Identifying and Alleviating the Feature Condensation in Batch Normalization and a Unified Framework

Effective and Efficient Batch Normalization Using a Few Uncorrelated Data for Statistics Estimation

Byzantine-resilient Federated Learning Employing Normalized Gradients on Non-IID Datasets

FedFN: Feature Normalization for Alleviating Data Heterogeneity Problem in Federated Learning

Test-time Batch Normalization

TOWARDS STABILIZING BATCH STATISTICS IN BACKWARD PROPAGATION OF BATCH NORMALIZATION

Revisiting Batch Normalization for Practical Domain Adaptation.

Stochastic Whitening Batch Normalization

FedStein: Enhancing Multi-Domain Federated Learning Through James-Stein Estimator