Empowering Federated Learning with Implicit Gossiping: Mitigating Connection Unreliability Amidst Unknown and Arbitrary Dynamics

Ming Xiang,Stratis Ioannidis,Edmund Yeh,Carlee Joe-Wong,Lili Su
2024-04-16
Abstract:Federated learning is a popular distributed learning approach for training a machine learning model without disclosing raw data. It consists of a parameter server and a possibly large collection of clients (e.g., in cross-device federated learning) that may operate in congested and changing environments. In this paper, we study federated learning in the presence of stochastic and dynamic communication failures wherein the uplink between the parameter server and client $i$ is on with unknown probability $p_i^t$ in round $t$. Furthermore, we allow the dynamics of $p_i^t$ to be arbitrary.
Distributed, Parallel, and Cluster Computing,Machine Learning
What problem does this paper attempt to address?
The paper attempts to address the problem of how to effectively conduct Federated Learning in the presence of random and dynamic communication failures. Specifically, the paper focuses on how to improve existing Federated Learning algorithms to ensure they can converge to the stable point of the global objective function when the uplink connection between the parameter server and clients is unreliable. ### Background and Problem Description 1. **Basic Concept of Federated Learning**: - Federated Learning is a distributed machine learning method where a parameter server and multiple clients (such as mobile devices) collaboratively train a model without sharing raw data. - Clients process local data and report updates to the parameter server in each training round, and the parameter server aggregates these updates to generate a new model. 2. **Communication Unreliability**: - In practical applications, Federated Learning systems are often deployed in congested and uncontrollable environments, such as mobile devices (smartphones, IoT devices). - The mobility of clients and environmental complexity lead to unreliable communication, which can vary significantly over time and across devices. - For example, when a smartphone passes through a tunnel on a train, the network connection to the base station may be lost. 3. **Limitations of Existing Research**: - Previous research assumes that communication failures are symmetric and have fixed statistical properties. - Some studies consider time-varying communication constraints but assume that the evolution of the client set follows a homogeneous Markov chain with a steady-state distribution. - These assumptions are difficult to hold in practical Federated Learning systems because the communication capabilities of clients are dynamically changing, and their dynamic characteristics are unknown and arbitrary. ### Main Contributions of the Paper 1. **Problem Identification**: - The authors theoretically and numerically demonstrate that when the connection probabilities of different clients are uneven, the most widely adopted Federated Learning algorithm—Federated Averaging (FedAvg)—cannot minimize the global objective function, even for simple convex loss functions. 2. **Proposed New Algorithm**: - The authors propose Federated Delayed Broadcast (FedPBC), a simple variant of FedAvg. In FedPBC, the parameter server delays the broadcast of the global model until the end of each round. - By delaying the broadcast, FedPBC can converge to the stable point of the non-convex global objective function even in the presence of uplink failures. - The delayed broadcast introduces an implicit gossip mechanism, allowing information mixing among clients with active links, thereby mitigating the bias caused by uneven and time-varying connection probabilities. 3. **Experimental Validation**: - The authors conducted extensive experiments on three real-world datasets to validate the effectiveness of the algorithm. - Experimental results show that FedPBC performs well under various unreliable uplink patterns, including time-varying and time-invariant Bernoulli, Markov, and periodic patterns. ### Conclusion The paper addresses the convergence problem of Federated Learning in the presence of random and dynamic communication failures by proposing the FedPBC algorithm. Experimental results validate the effectiveness of the algorithm, providing a new solution for the reliability and robustness of Federated Learning in practical applications.