Abstract:Mobile edge computing (MEC) is a novel computing paradigm that pushes computation and storage resources to the edge of the network. The interconnection of edge servers forms small-scale data centers, enabling MEC to provide low-latency network services for mobile users. Nowadays, Remote Direct Memory Access (RDMA) has been widely deployed in such data centers to reduce CPU overhead and network latency. Plenty of congestion control mechanisms have been proposed for RDMA data centers, aiming to provide low-latency data delivery and high throughput network services. However, our fine-grained experimental analysis reveals that existing congestion control mechanisms still have performance limitations due to inappropriate congestion notifications and the long congestion feedback cycle. In this paper, we propose Mercury, which is an accurate and fast congestion feedback mechanism. Mercury comprises two key components: (1) the state-driven congestion detection and (2) the window-based congestion notification. Specifically, the state-driven congestion detection monitors the queue length and the number of packets received at the switch egress port when the PFC is triggered. It determines the states of egress ports and identifies flows that really contribute to congestion. Then, window-based congestion notification calculates the window sizes for congested flows and rapidly returns Congestion Notification Packets (CNPs) with the window information to the sender. It facilitates the rate adjustment of congested flows. Mercury is compatible with existing RDMA CC mechanisms and can be easily implemented in switches. We employ real-world data sets and conduct both micro-benchmark and large-scale simulations to evaluate the performance of Mercury. The results indicate that, thanks to the accurate and fast congestion feedback, Mercury achieves a reduction in the 99th tail flow completion time by up to 45.1%, 41.8%, 38.7%, 30.9%, and 37.9% compared with Timely, DCQCN, DCQCN+TCD, PACC, and HPCC, respectively.

Mercury: A Simple Transport Layer Scheduler to Accelerate Distributed DNN Training

Accelerating Distributed DNN Training via Transport Layer Scheduling

US-Byte: an Efficient Communication Framework for Scheduling Unequal-Sized Tensor Blocks in Distributed Deep Learning

A generic communication scheduler for distributed DNN training acceleration

Adaptive Partitioning and Efficient Scheduling for Distributed DNN Training in Heterogeneous IoT Environment

Prophet: Speeding Up Distributed DNN Training with Predictable Communication Scheduling.

AccEPT: an Acceleration Scheme for Speeding Up Edge Pipeline-parallel Training

Rationing Bandwidth Resources for Mitigating Network Resource Contention in Distributed DNN Training Clusters.

NetReduce: RDMA-Compatible In-Network Reduction for Distributed DNN Training Acceleration

Themis: A Network Bandwidth-Aware Collective Scheduling Policy for Distributed Training of DL Models

Canary: Decentralized Distributed Deep Learning Via Gradient Sketch and Partition in Multi-Interface Networks

Accurate and Fast Congestion Feedback in MEC-enabled RDMA Datacenters

QSync: Quantization-Minimized Synchronous Distributed Training Across Hybrid Devices

A tree-recursive partitioned multicast mechanism for NoC-based deep neural network accelerator

Heter-Train: A Distributed Training Framework Based on Semi-Asynchronous Parallel Mechanism for Heterogeneous Intelligent Transportation Systems

MoNTA: Accelerating Mixture-of-Experts Training with Network-Traffc-Aware Parallel Optimization

On Optimizing the Communication of Model Parallelism

Merak: An Efficient Distributed DNN Training Framework with Automated 3D Parallelism for Giant Foundation Models

Don't be fat: Towards efficient online flow scheduling in data center networks

Optimizing execution for pipelined‐based distributed deep learning in a heterogeneously networked GPU cluster

MG-WFBP: Merging Gradients Wisely for Efficient Communication in Distributed Deep Learning