Making Batch Normalization Great in Federated Deep Learning

Jike Zhong,Hong-You Chen,Wei-Lun Chao
2024-03-29
Abstract:Batch Normalization (BN) is widely used in {centralized} deep learning to improve convergence and generalization. However, in {federated} learning (FL) with decentralized data, prior work has observed that training with BN could hinder performance and suggested replacing it with Group Normalization (GN). In this paper, we revisit this substitution by expanding the empirical study conducted in prior work. Surprisingly, we find that BN outperforms GN in many FL settings. The exceptions are high-frequency communication and extreme non-IID regimes. We reinvestigate factors that are believed to cause this problem, including the mismatch of BN statistics across clients and the deviation of gradients during local training. We empirically identify a simple practice that could reduce the impacts of these factors while maintaining the strength of BN. Our approach, which we named FIXBN, is fairly easy to implement, without any additional training or communication costs, and performs favorably across a wide range of FL settings. We hope that our study could serve as a valuable reference for future practical usage and theoretical analysis in FL.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
This paper discusses the usage of Batch Normalization (BN) and Group Normalization (GN) in Federated Learning (FL). In centralized deep learning, BN typically improves the convergence speed and generalization ability of models. However, previous research has found that BN may reduce performance in FL and suggests using GN instead. However, the authors re-examine this issue and find that in many FL settings, BN actually performs better than GN, especially in low communication frequency or less severe non-IID data scenarios. The paper points out that BN performs poorly in high communication frequency and extreme non-IID distributions, which is related to the mismatch of local batch statistics and gradient biases during training. The authors further investigate these factors and propose a method called FIXBN, which preserves the advantages of BN while alleviating the negative effects caused by the mismatch of statistical information and gradient biases. FIXBN does not require additional training or communication costs and performs well in a wide range of FL settings. The working principle of FIXBN is to use the standard BN method for training in the initial stage, and as the communication rounds increase, freeze the BN layer and use globally accumulated statistical information for feature normalization. This restores gradients similar to centralized learning in high communication frequency settings and eliminates the mismatch of normalization statistical information during training and testing. Through extensive empirical research, the authors demonstrate the performance of BN in various FL settings and suggest considering the specific setting of FL when selecting normalization methods. FIXBN significantly improves the performance of BN in high communication frequency scenarios and performs better than GN and BN in various FL settings. The authors hope that this research can provide valuable references for future FL practices and theoretical analysis.