Abstract:Explicit Congestion Notification (ECN) is a deployment to IP and TCP also playing a crucial role in the congestion control of the Data Center Networks (DCNs). Most DCNs use a single queue scenario in each switch port. However, in the production of DCNs, the industry trend is moving towards one farther queue per port. Therefore, Multi-Service Multi-Queue Data Centers (MQ-ECN) have been proposed to ignite this service; afterward, DemePro improved the MQ-ECN. Yet, the fact that overhead and imprecise measurement are non-negligible for both models should be borne in mind. Also, MQ-ECN works for a round base scheduler while the overflow problem could be contributed by DemePro. Moreover, ECN was designed for single queue scenarios and having MQ-ECN is harmful at least for scheduling of flows. To solve the problem, we could take advantage of the lack of MQ-ECN and propose a machine-learning based dynamic threshold control scheme for ECN marking in DCN, which we named it DC-ECN (Data Center-Explicit Congestion Notification)–a first systematic solution to the problems which have already been mentioned earlier. The main point of DC-ECN is a separation of the mice and elephant flows in dual couple queues using machine learning. Then, to locate them into the requested queue with demand ECN marking threshold independently to achieve low latency and high throughput. Also, by dynamically increase and decrease the ECN marking threshold in elephant buffer, DC-ECN will never mark mice flows and succeed to absorb micro-burst mice traffic to have lowest latency without having sacrifice the throughput. Our mathematical analysis and simulation demonstrate that a steady state behavior of DC-ECN achieves 21.8% and 16.5% less flow completion time compared with MQ-ECN and DemoPro, respectively.

RDMA Based Congestion Control Strategy for DML Training Optimization in Data Center Networks

Traffic Management for Distributed Machine Learning in RDMA-enabled Data Center Networks.

Efficient Communication Scheduling for Parameter Synchronization of DML in Data Center Networks

Impact of Network Topology on the Performance of DML: Theoretical Analysis and Practical Factors

Congestion-aware Critical Gradient Scheduling for Distributed Machine Learning in Data Center Networks

An Efficient Distributed Machine Learning Framework in Wireless D2D Networks: Convergence Analysis and System Implementation

L3DML: Facilitating Geo-Distributed Machine Learning in Network Layer

MLTCP: Congestion Control for DNN Training

DSANA: A Distributed Machine Learning Acceleration Solution Based on Dynamic Scheduling and Network Acceleration

A Scalable, High-Performance, and Fault-Tolerant Network Architecture for Distributed Machine Learning

TSEngine: Enable Efficient Communication Overlay in Distributed Machine Learning in WANs

Poster: Chameleon: Automatic and Adaptive Tuning for DCQCN Parameters in RDMA Networks

Impact of Synchronization Topology on DML Performance: Both Logical Topology and Physical Topology

RECC: Joint Congestion Control Based on RTT and ECN for High-speed RDMA Networks

DCQCN+: Taming Large-Scale Incast Congestion in RDMA over Ethernet Networks

DC-ECN: A machine-learning based dynamic threshold control scheme for ECN marking in DCN

Job-aware Communication Scheduling for DML Training in Shared Cluster

Accelerating Model Synchronization for Distributed Machine Learning in an Optical Wide Area Network

Reliable and Efficient RAR-based Distributed Model Training in Computing Power Network

Joint Model Pruning and Topology Construction for Accelerating Decentralized Machine Learning

DLB: A Dynamic Load Balance Strategy for Distributed Training of Deep Neural Networks