Abstract:Distributed machine learning (DML) can be an important capability for modern military to take advantage of data and devices distributed at multiple vantage points to adapt and learn. The existing distributed machine learning frameworks, however, cannot realize the full benefits of DML, because they are all based on the simple linear aggregation framework, but linear aggregation cannot handle the $\textit{divergence challenges}$ arising in military settings: the learning data at different devices can be heterogeneous ($\textit{i.e.}$, Non-IID data), leading to model divergence, but the ability for devices to communicate is substantially limited ($\textit{i.e.}$, weak connectivity due to sparse and dynamic communications), reducing the ability for devices to reconcile model divergence. In this paper, we introduce a novel DML framework called aggregation in the mirror space (AIMS) that allows a DML system to introduce a general mirror function to map a model into a mirror space to conduct aggregation and gradient descent. Adapting the convexity of the mirror function according to the divergence force, AIMS allows automatic optimization of DML. We conduct both rigorous analysis and extensive experimental evaluations to demonstrate the benefits of AIMS. For example, we prove that AIMS achieves a loss of $O\left((\frac{m^{r+1}}{T})^{\frac1r}\right)$ after $T$ network-wide updates, where $m$ is the number of devices and $r$ the convexity of the mirror function, with existing linear aggregation frameworks being a special case with $r=2$. Our experimental evaluations using EMANE (Extendable Mobile Ad-hoc Network Emulator) for military communications settings show similar results: AIMS can improve DML convergence rate by up to 57\% and scale well to more devices with weak connectivity, all with little additional computation overhead compared to traditional linear aggregation.

BML: A High-performance, Low-cost Gradient Synchronization Algorithm for DML Training

A Scalable, High-Performance, and Fault-Tolerant Network Architecture for Distributed Machine Learning

WBSP: Addressing Stragglers in Distributed Machine Learning with Worker-Busy Synchronous Parallel

Impact of Network Topology on the Performance of DML: Theoretical Analysis and Practical Factors

Impact of Synchronization Topology on DML Performance: Both Logical Topology and Physical Topology

LBB: load-balanced batching for efficient distributed learning on heterogeneous GPU cluster

Congestion-aware Critical Gradient Scheduling for Distributed Machine Learning in Data Center Networks

Adaptive Partitioning and Efficient Scheduling for Distributed DNN Training in Heterogeneous IoT Environment

HiPS - Hierarchical Parameter Synchronization in Large-Scale Distributed Machine Learning.

Communication-Efficient Training Workload Balancing for Decentralized Multi-Agent Learning

Fela: Incorporating Flexible Parallelism and Elastic Tuning to Accelerate Large-Scale DML

When Less is More: Achieving Faster Convergence in Distributed Edge Machine Learning

Joint Dynamic Grouping and Gradient Coding for Time-Critical Distributed Machine Learning in Heterogeneous Edge Networks

DBS: Dynamic Batch Size For Distributed Deep Neural Network Training

Boosting Distributed Machine Learning Training Through Loss-tolerant Transmission Protocol

AutoDDL: Automatic Distributed Deep Learning With Near-Optimal Bandwidth Cost

Cloudless-Training: A Framework to Improve Efficiency of Geo-Distributed ML Training

Aggregation in the Mirror Space (AIMS): Fast, Accurate Distributed Machine Learning in Military Settings

MLlib*: Fast Training of GLMs Using Spark MLlib

Semi-Dynamic Load Balancing: Efficient Distributed Learning in Non-Dedicated Environments

Hiding Communication Cost in Distributed LLM Training via Micro-batch Co-execution